public:t-malv-15-3:2
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| public:t-malv-15-3:2 [2015/08/27 12:31] – [2. Regular Expressions to match words] orvark | public:t-malv-15-3:2 [2024/04/29 13:33] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Lab 2 ====== | ====== Lab 2 ====== | ||
| - | **Complete | + | **Try to complete |
| If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. | If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. | ||
| Line 47: | Line 47: | ||
| You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. '' | You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. '' | ||
| - | **Is there anything you thought should have handled differently in the tokenization process?** | + | **Is there anything you thought should have been handled differently in the tokenization process?** |
| ===== 2. Regular Expressions to match words ===== | ===== 2. Regular Expressions to match words ===== | ||
| Line 78: | Line 78: | ||
| </ | </ | ||
| - | Now lets use re.match to examine the words in the list of tokens in part 1. | + | Now lets use '' |
| We can for example use it to try find all mentions of years in the text. | We can for example use it to try find all mentions of years in the text. | ||
| Line 168: | Line 168: | ||
| (Problem 25 in chapter 2: [[http:// | (Problem 25 in chapter 2: [[http:// | ||
| + | |||
| + | ===== Possible Solutions ===== | ||
| + | |||
| + | <code python> | ||
| + | #1 | ||
| + | \w+|[^\w\s]+ | ||
| + | |||
| + | #2 | ||
| + | ^[A-Z]+$ | ||
| + | ^\d\D+$ | ||
| + | |||
| + | #3 | ||
| + | def bigrams_from_tokens( token_list): | ||
| + | bigram_fd = nltk.FreqDist() | ||
| + | |||
| + | bigram_fd[' | ||
| + | |||
| + | for i in range(1, len(token_list)): | ||
| + | bigram_fd[token_list[i-1] + ' ' + token_list[i]] += 1 | ||
| + | |||
| + | return bigram_fd | ||
| + | |||
| + | #This problem can be solved in numerous other ways. E.g. using the bigram function mentioned in chapter | ||
| + | #one to create a list of bigram tuples that are then fed into the FreqDist, or simply used as keys. | ||
| + | |||
| + | #4 | ||
| + | names = nltk.corpus.names | ||
| + | |||
| + | cfd = nltk.ConditionalFreqDist() | ||
| + | for w in names.words(' | ||
| + | cfd[' | ||
| + | for w in names.words(' | ||
| + | cfd[' | ||
| + | |||
| + | #one liner version (creates a list of tuples (gender, initial) that the cfd is initialized with): | ||
| + | # | ||
| + | # for file in names.fileids() | ||
| + | # for name in names.words(file)) | ||
| + | |||
| + | #output the results | ||
| + | # | ||
| + | #cfd.plot() | ||
| + | |||
| + | #5 | ||
| + | |||
| + | def find_language(word): | ||
| + | latin1_langs = [w for w in nltk.corpus.udhr.fileids() if w.endswith(' | ||
| + | return [l[:-7] for l in latin1_langs if word in nltk.corpus.udhr.words(l)] | ||
| + | |||
| + | #>>> | ||
| + | # | ||
| + | |||
| + | </ | ||
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/2.1440678703.txt.gz · Last modified: 2024/04/29 13:32 (external edit)