public:t-malv-15-3:2
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:t-malv-15-3:2 [2015/08/27 02:01] – [1. Tokenization with regular expressions] orvark | public:t-malv-15-3:2 [2024/04/29 13:33] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Lab 2 ====== | ====== Lab 2 ====== | ||
- | **Complete | + | **Try to complete |
If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. | If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. | ||
Line 47: | Line 47: | ||
You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. '' | You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. '' | ||
- | **Is there anything you thought should have handled differently in the tokenization process?** | + | **Is there anything you thought should have been handled differently in the tokenization process?** |
===== 2. Regular Expressions to match words ===== | ===== 2. Regular Expressions to match words ===== | ||
Line 57: | Line 57: | ||
The function '' | The function '' | ||
- | It is possible to check truth value of the object: | + | It is possible to check the truth value of the object: |
<code python> | <code python> | ||
Line 78: | Line 78: | ||
</ | </ | ||
- | Now lets use re.match to examine the words in the list of tokens in part 1. | + | Now lets use '' |
We can for example use it to try find all mentions of years in the text. | We can for example use it to try find all mentions of years in the text. | ||
Line 168: | Line 168: | ||
(Problem 25 in chapter 2: [[http:// | (Problem 25 in chapter 2: [[http:// | ||
+ | |||
+ | ===== Possible Solutions ===== | ||
+ | |||
+ | <code python> | ||
+ | #1 | ||
+ | \w+|[^\w\s]+ | ||
+ | |||
+ | #2 | ||
+ | ^[A-Z]+$ | ||
+ | ^\d\D+$ | ||
+ | |||
+ | #3 | ||
+ | def bigrams_from_tokens( token_list): | ||
+ | bigram_fd = nltk.FreqDist() | ||
+ | |||
+ | bigram_fd[' | ||
+ | |||
+ | for i in range(1, len(token_list)): | ||
+ | bigram_fd[token_list[i-1] + ' ' + token_list[i]] += 1 | ||
+ | |||
+ | return bigram_fd | ||
+ | |||
+ | #This problem can be solved in numerous other ways. E.g. using the bigram function mentioned in chapter | ||
+ | #one to create a list of bigram tuples that are then fed into the FreqDist, or simply used as keys. | ||
+ | |||
+ | #4 | ||
+ | names = nltk.corpus.names | ||
+ | |||
+ | cfd = nltk.ConditionalFreqDist() | ||
+ | for w in names.words(' | ||
+ | cfd[' | ||
+ | for w in names.words(' | ||
+ | cfd[' | ||
+ | |||
+ | #one liner version (creates a list of tuples (gender, initial) that the cfd is initialized with): | ||
+ | # | ||
+ | # for file in names.fileids() | ||
+ | # for name in names.words(file)) | ||
+ | |||
+ | #output the results | ||
+ | # | ||
+ | #cfd.plot() | ||
+ | |||
+ | #5 | ||
+ | |||
+ | def find_language(word): | ||
+ | latin1_langs = [w for w in nltk.corpus.udhr.fileids() if w.endswith(' | ||
+ | return [l[:-7] for l in latin1_langs if word in nltk.corpus.udhr.words(l)] | ||
+ | |||
+ | #>>> | ||
+ | # | ||
+ | |||
+ | </ |
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/2.1440640881.txt.gz · Last modified: 2024/04/29 13:32 (external edit)