public:t-malv-15-3:4
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| public:t-malv-15-3:4 [2015/09/10 02:38] – created orvark | public:t-malv-15-3:4 [2024/04/29 13:33] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 19: | Line 19: | ||
| * Create a Bigram Collocation Finder for the Brown Corpus. | * Create a Bigram Collocation Finder for the Brown Corpus. | ||
| * Apply a filter to remove bigrams that occur less than two times. | * Apply a filter to remove bigrams that occur less than two times. | ||
| - | * Apply a filter to remove stopwords (' | + | * Apply a filter to remove |
| - | * Print out the 20 most frequent bigrams. | + | * Print out the 20 most **frequent bigrams**. |
| REPEAT THIS FOR BOTH: | REPEAT THIS FOR BOTH: | ||
| - | * Bigrams using a window of size 3 | + | * Bigrams using a window of size 3. |
| - | * Trigrams | + | * Trigrams. |
| - | ===== 2. Trying out different | + | ===== 2. Trying out different |
| TODO: | TODO: | ||
| * Try out some other measure functions, at least '' | * Try out some other measure functions, at least '' | ||
| - | * Print out the top 20 bigrams using each measure | + | * Print out the top 20 bigrams using each of the selected |
| - | * You can use '' | + | * You can use '' |
| [[http:// | [[http:// | ||
| Line 42: | Line 42: | ||
| * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**. | * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**. | ||
| - | * Print out the 20 most frequent | + | * Print out the 20 most frequent |
| - | * Do you think any other frequency | + | * Do you think any other association |
| ===== 4. Correlations between association measure functions ===== | ===== 4. Correlations between association measure functions ===== | ||
| Line 54: | Line 54: | ||
| print(' | print(' | ||
| ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), | ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), | ||
| - | ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))) | + | ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq)))) |
| </ | </ | ||
| Line 62: | Line 62: | ||
| ===== 5. Working with Bigram scores | ===== 5. Working with Bigram scores | ||
| - | The following code snippet shows how you can use '' | + | The following code snippet shows an example of how you can use '' |
| <code python> | <code python> | ||
| Line 74: | Line 74: | ||
| nltk.collocations.BigramAssocMeasures.likelihood_ratio) | nltk.collocations.BigramAssocMeasures.likelihood_ratio) | ||
| - | #group by first word in bigram | + | #create a defaultdict of lists |
| prev_word = defaultdict(list) | prev_word = defaultdict(list) | ||
| + | |||
| + | #group by first word in bigram | ||
| for key, scores in scored: | for key, scores in scored: | ||
| | | ||
| - | #sort by strongest | + | #sort each list by highest |
| for key in prev_word: | for key in prev_word: | ||
| | | ||
| Line 90: | Line 92: | ||
| what_comes_after(' | what_comes_after(' | ||
| what_comes_after(' | what_comes_after(' | ||
| + | </ | ||
| + | |||
| + | FYI: a normal Python dictionary throws a '' | ||
| + | |||
| + | ===== Solutions | ||
| + | |||
| + | <code python> | ||
| + | import nltk | ||
| + | from nltk.collocations import * | ||
| + | from nltk.metrics import BigramAssocMeasures, | ||
| + | from nltk.corpus import brown, stopwords | ||
| + | |||
| + | #1 | ||
| + | |||
| + | bam = BigramAssocMeasures | ||
| + | |||
| + | corpus = brown.words() | ||
| + | |||
| + | finder = BigramCollocationFinder.from_words(corpus) | ||
| + | |||
| + | word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words(' | ||
| + | #def word_filter(w): | ||
| + | |||
| + | |||
| + | finder.apply_freq_filter(2) | ||
| + | finder.apply_word_filter(word_filter) | ||
| + | |||
| + | print(finder.nbest(bam.raw_freq, | ||
| + | |||
| + | |||
| + | finder_win3 = BigramCollocationFinder.from_words(corpus, | ||
| + | finder_win3.apply_freq_filter(2) | ||
| + | finder_win3.apply_word_filter(word_filter) | ||
| + | print(finder_win3.nbest(bam.raw_freq, | ||
| + | |||
| + | |||
| + | tam = TrigramAssocMeasures | ||
| + | |||
| + | finder_tri = TrigramCollocationFinder.from_words(corpus) | ||
| + | finder_tri.apply_freq_filter(2) | ||
| + | finder_tri.apply_word_filter(word_filter) | ||
| + | print(finder_tri.nbest(tam.raw_freq, | ||
| + | |||
| + | #2 | ||
| + | |||
| + | # Pointwise mutal information | ||
| + | print(finder.nbest(bam.pmi, | ||
| + | # Log-likelihood ratio | ||
| + | print(finder.nbest(bam.likelihood_ratio, | ||
| + | # Mutal information likelihood, a mi variant | ||
| + | print(finder.nbest(bam.mi_like, | ||
| + | # Chi squared test | ||
| + | print(finder.nbest(bam.chi_sq, | ||
| + | # Student' | ||
| + | print(finder.nbest(bam.student_t, | ||
| + | |||
| + | #3 | ||
| + | |||
| + | tagged_corpus = brown.tagged_words(tagset=' | ||
| + | |||
| + | finder_tagged = BigramCollocationFinder.from_words(tagged_corpus) | ||
| + | print(finder_tagged.nbest(bam.raw_freq, | ||
| + | |||
| + | finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus) | ||
| + | print(finder_tags.nbest(bam.raw_freq, | ||
| </ | </ | ||
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/4.1441852697.txt.gz · Last modified: 2024/04/29 13:32 (external edit)