====== Lab 4 ======

See the [[http://www.nltk.org/howto/collocations.html|NLTK Collocations howtow]] for help.

===== 1. Bi- and Trigram Collocation finders =====

import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords

TODO:
* Create a Bigram Collocation Finder for the Brown Corpus.
* Apply a filter to remove bigrams that occur less than two times.
* Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter.
* Print out the 20 most **frequent bigrams**. REPEAT THIS FOR BOTH: * Bigrams using a window of size 3. * Trigrams. ===== 2. Trying out different association measure functions ===== TODO: * Try out some other measure functions, at least ''pmi'', ''likelihood_ratio'', ''mi_like'', ''chi_sq''. * Print out the top 20 bigrams using each of the selected measure functions. * You can use ''help(BigramAssocMeasures.pmi)'' etc. and google to try to understand the difference between them. [[http://www.nltk.org/_modules/nltk/metrics/association.html]] ===== 3. Bigrams and tagged corpora ===== TODO: * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus. * Print out the 20 most frequent word/tag bigrams. * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**. * Print out the 20 most frequent tag/tag bigrams. * Do you think any other association measures than raw frequency might be useful here? ===== 4. Correlations between association measure functions ===== The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare ''pmi'' agains the ''raw_freq'' of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare ''likelihood_ratio'' and ''mi_like'' etc. finder = [...] print('Correlation: %0.3f' % spearman_correlation( ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq)))) TODO: * Compare different association measures and see which are similar and which are farthest from the raw frequency. ===== 5. Working with Bigram scores ===== The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code.

import nltk.collocations
from nltk.corpus import brown
from collections import defaultdict

finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams(
    nltk.collocations.BigramAssocMeasures.likelihood_ratio)

#create a defaultdict of lists
prev_word = defaultdict(list)

#group by first word in bigram
for key, scores in scored:
    prev_word[key[0]].append((key[1], scores))

#sort each list by highest association measure
for key in prev_word:
    prev_word[key].sort(key = lambda x: -x[1])

def what_comes_after( word, num):
    print(word.upper(), ': ', [prev_word[word][:num]])

#wait for it ...
what_comes_after('strong', 10)
what_comes_after('powerful', 10)

FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]]. ===== Solutions ===== import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords #1 bam = BigramAssocMeasures corpus = brown.words() # ok to use a subset e.g. 'ca01' for testing finder = BigramCollocationFinder.from_words(corpus) word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english') #def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english') finder.apply_freq_filter(2) finder.apply_word_filter(word_filter) print(finder.nbest(bam.raw_freq, 20)) finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3) finder_win3.apply_freq_filter(2) finder_win3.apply_word_filter(word_filter) print(finder_win3.nbest(bam.raw_freq, 20)) tam = TrigramAssocMeasures finder_tri = TrigramCollocationFinder.from_words(corpus) finder_tri.apply_freq_filter(2) finder_tri.apply_word_filter(word_filter) print(finder_tri.nbest(tam.raw_freq, 20)) #2 # Pointwise mutal information print(finder.nbest(bam.pmi, 20)) # Log-likelihood ratio print(finder.nbest(bam.likelihood_ratio, 20)) # Mutal information likelihood, a mi variant print(finder.nbest(bam.mi_like, 20)) # Chi squared test print(finder.nbest(bam.chi_sq, 20)) # Student's t-test, w/independence hypothesis for unigrams print(finder.nbest(bam.student_t, 20)) #3 tagged_corpus = brown.tagged_words(tagset='universal') # ok to use a subset e.g. 'ca01' for testing finder_tagged = BigramCollocationFinder.from_words(tagged_corpus) print(finder_tagged.nbest(bam.raw_freq, 20)) finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus) print(finder_tags.nbest(bam.raw_freq, 20))