Differences

This shows you the differences between two versions of the page.

--- public:t-malv-15-3:2 [2015/08/27 02:01] – [1. Tokenization with regular expressions] orvark
+++ public:t-malv-15-3:2 [2024/04/29 13:33] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== Lab 2 ======
-**Complete at least 3 of the 5 problems. Hand in your code in MySchool before midnight today (27 August). A single .py file containing the code in the same order as the given problems.** You can use File->New File in IDLE to create the file.
+**Try to complete at least 3 of the 5 problems. Hand in your code in MySchool before midnight today (27 August). A single file, ''FullName_lab2.py'' containing the code in the same order as the given problems.** You can use File->New File in IDLE to create the file.
 If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error.
@@ Line 47: / Line 47: @@
 You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. ''print(my_tokens[idx-5:idx+5])'')
-**Is there anything you thought should have handled differently in the tokenization process?**
+**Is there anything you thought should have been handled differently in the tokenization process?**
 ===== 2. Regular Expressions to match words =====
@@ Line 57: / Line 57: @@
 The function ''re.match()'' returns a Match Objects (see https://docs.python.org/2/library/re.html#re.MatchObject).
-It is possible to check truth value of the object:
+It is possible to check the truth value of the object:
 <code python>
@@ Line 78: / Line 78: @@
 </code>
-Now lets use re.match to examine the words in the list of tokens in part 1.
+Now lets use ''re.match'' to examine the words in the list of tokens in part 1.
 We can for example use it to try find all mentions of years in the text.
@@ Line 168: / Line 168: @@
 (Problem 25 in chapter 2: [[http://www.nltk.org/book/ch02.html]])
+===== Possible Solutions =====
+<code python>
+#1
+\w+|[^\w\s]+   # one or more word-chars or not (word-chars or whitespace)
+#2
+^[A-Z]+$       # uppercase words, one or more uppercase letters
+^\d\D+$        # ordinal numbers (1st, 2nd, 3rd ...), a digit followed by one or more non-digit
+#3
+def bigrams_from_tokens( token_list):
+    bigram_fd = nltk.FreqDist()
+    bigram_fd['. '+token_list[0]] = 1  # first bigram ('.' padding + first word)
+    for i in range(1, len(token_list)):
+        bigram_fd[token_list[i-1] + ' ' + token_list[i]] += 1
+    return bigram_fd
+#This problem can be solved in numerous other ways. E.g. using the bigram function mentioned in chapter
+#one to create a list of bigram tuples that are then fed into the FreqDist, or simply used as keys.
+#4
+names = nltk.corpus.names
+cfd = nltk.ConditionalFreqDist()
+for w in names.words('male.txt'):
+    cfd['male'][w[0]] += 1
+for w in names.words('female.txt'):
+    cfd['female'][w[0]] += 1
+#one liner version (creates a list of tuples (gender, initial) that the cfd is initialized with):
+#cfd=nltk.ConditionalFreqDist((file[:-4],name[0])
+#                             for file in names.fileids()
+#                             for name in names.words(file))
+#output the results
+#cfd.tabulate()
+#cfd.plot()
+#5
+def find_language(word):
+    latin1_langs = [w for w in nltk.corpus.udhr.fileids() if w.endswith('-Latin1')]
+    return [l[:-7] for l in latin1_langs if word in nltk.corpus.udhr.words(l)]
+#>>> find_language('að')
+#['Icelandic_Yslenska']
+</code>