Try to complete as many problems as you can. Hand in a python code file (fullName_lab4.py
) with what you have finished in MySchool before midnight today (10 September).
If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.
See the NLTK Collocations howtow for help.
import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords
TODO:
stopwords.words('english')
) and words that are two characters or shorter.REPEAT THIS FOR BOTH:
TODO:
pmi
, likelihood_ratio
, mi_like
, chi_sq
.help(BigramAssocMeasures.pmi)
etc. and google to try to understand the difference between them.TODO:
The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare pmi
agains the raw_freq
of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare likelihood_ratio
and mi_like
etc.
finder = [...] print('Correlation: %0.3f' % spearman_correlation( ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))
TODO:
The following code snippet shows an example of how you can use score_ngrams
to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. Note that this is not a problem to be solved so there is no need to hand anything in.
import nltk.collocations from nltk.corpus import brown from collections import defaultdict finder = nltk.collocations.BigramCollocationFinder.from_words( nltk.corpus.brown.words()) scored = finder.score_ngrams( nltk.collocations.BigramAssocMeasures.likelihood_ratio) #create a defaultdict of lists prev_word = defaultdict(list) #group by first word in bigram for key, scores in scored: prev_word[key[0]].append((key[1], scores)) #sort each list by highest association measure for key in prev_word: prev_word[key].sort(key = lambda x: -x[1]) def what_comes_after( word, num): print(word.upper(), ': ', [prev_word[word][:num]]) #wait for it ... what_comes_after('strong', 10) what_comes_after('powerful', 10)
FYI: a normal Python dictionary throws a KeyError
if you try to get an item with a key that is not currently in the dictionary. The defaultdict
in contrast will simply create any items that you try to access if they don't exist. See section 3.4 in chapter 5 in the NLTK book.
import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords #1 bam = BigramAssocMeasures corpus = brown.words() # ok to use a subset e.g. 'ca01' for testing finder = BigramCollocationFinder.from_words(corpus) word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english') #def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english') finder.apply_freq_filter(2) finder.apply_word_filter(word_filter) print(finder.nbest(bam.raw_freq, 20)) finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3) finder_win3.apply_freq_filter(2) finder_win3.apply_word_filter(word_filter) print(finder_win3.nbest(bam.raw_freq, 20)) tam = TrigramAssocMeasures finder_tri = TrigramCollocationFinder.from_words(corpus) finder_tri.apply_freq_filter(2) finder_tri.apply_word_filter(word_filter) print(finder_tri.nbest(tam.raw_freq, 20)) #2 # Pointwise mutal information print(finder.nbest(bam.pmi, 20)) # Log-likelihood ratio print(finder.nbest(bam.likelihood_ratio, 20)) # Mutal information likelihood, a mi variant print(finder.nbest(bam.mi_like, 20)) # Chi squared test print(finder.nbest(bam.chi_sq, 20)) # Student's t-test, w/independence hypothesis for unigrams print(finder.nbest(bam.student_t, 20)) #3 tagged_corpus = brown.tagged_words(tagset='universal') # ok to use a subset e.g. 'ca01' for testing finder_tagged = BigramCollocationFinder.from_words(tagged_corpus) print(finder_tagged.nbest(bam.raw_freq, 20)) finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus) print(finder_tags.nbest(bam.raw_freq, 20))