====== Lab 4 ====== **Try to complete as many problems as you can. Hand in a python code file (''fullName_lab4.py'') with what you have finished in MySchool before midnight today (10 September). ** If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. See the [[http://www.nltk.org/howto/collocations.html|NLTK Collocations howtow]] for help. ===== 1. Bi- and Trigram Collocation finders ===== import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords TODO: * Create a Bigram Collocation Finder for the Brown Corpus. * Apply a filter to remove bigrams that occur less than two times. * Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter. * Print out the 20 most **frequent bigrams**. REPEAT THIS FOR BOTH: * Bigrams using a window of size 3. * Trigrams. ===== 2. Trying out different association measure functions ===== TODO: * Try out some other measure functions, at least ''pmi'', ''likelihood_ratio'', ''mi_like'', ''chi_sq''. * Print out the top 20 bigrams using each of the selected measure functions. * You can use ''help(BigramAssocMeasures.pmi)'' etc. and google to try to understand the difference between them. [[http://www.nltk.org/_modules/nltk/metrics/association.html]] ===== 3. Bigrams and tagged corpora ===== TODO: * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus. * Print out the 20 most frequent word/tag bigrams. * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**. * Print out the 20 most frequent tag/tag bigrams. * Do you think any other association measures than raw frequency might be useful here? ===== 4. Correlations between association measure functions ===== The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare ''pmi'' agains the ''raw_freq'' of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare ''likelihood_ratio'' and ''mi_like'' etc. finder = [...] print('Correlation: %0.3f' % spearman_correlation( ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq)))) TODO: * Compare different association measures and see which are similar and which are farthest from the raw frequency. ===== 5. Working with Bigram scores ===== The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. **Note that this is not a problem to be solved so there is no need to hand anything in.** import nltk.collocations from nltk.corpus import brown from collections import defaultdict finder = nltk.collocations.BigramCollocationFinder.from_words( nltk.corpus.brown.words()) scored = finder.score_ngrams( nltk.collocations.BigramAssocMeasures.likelihood_ratio) #create a defaultdict of lists prev_word = defaultdict(list) #group by first word in bigram for key, scores in scored: prev_word[key[0]].append((key[1], scores)) #sort each list by highest association measure for key in prev_word: prev_word[key].sort(key = lambda x: -x[1]) def what_comes_after( word, num): print(word.upper(), ': ', [prev_word[word][:num]]) #wait for it ... what_comes_after('strong', 10) what_comes_after('powerful', 10) FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]]. ===== Solutions ===== import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords #1 bam = BigramAssocMeasures corpus = brown.words() # ok to use a subset e.g. 'ca01' for testing finder = BigramCollocationFinder.from_words(corpus) word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english') #def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english') finder.apply_freq_filter(2) finder.apply_word_filter(word_filter) print(finder.nbest(bam.raw_freq, 20)) finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3) finder_win3.apply_freq_filter(2) finder_win3.apply_word_filter(word_filter) print(finder_win3.nbest(bam.raw_freq, 20)) tam = TrigramAssocMeasures finder_tri = TrigramCollocationFinder.from_words(corpus) finder_tri.apply_freq_filter(2) finder_tri.apply_word_filter(word_filter) print(finder_tri.nbest(tam.raw_freq, 20)) #2 # Pointwise mutal information print(finder.nbest(bam.pmi, 20)) # Log-likelihood ratio print(finder.nbest(bam.likelihood_ratio, 20)) # Mutal information likelihood, a mi variant print(finder.nbest(bam.mi_like, 20)) # Chi squared test print(finder.nbest(bam.chi_sq, 20)) # Student's t-test, w/independence hypothesis for unigrams print(finder.nbest(bam.student_t, 20)) #3 tagged_corpus = brown.tagged_words(tagset='universal') # ok to use a subset e.g. 'ca01' for testing finder_tagged = BigramCollocationFinder.from_words(tagged_corpus) print(finder_tagged.nbest(bam.raw_freq, 20)) finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus) print(finder_tags.nbest(bam.raw_freq, 20))