====== Lab 4 ====== **Try to complete as many problems as you can. Hand in a python code file (''fullName_lab4.py'') with what you have finished in MySchool before midnight today (10 September). ** If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. See the [[http://www.nltk.org/howto/collocations.html|NLTK Collocations howtow]] for help. ===== 1. Bi- and Trigram Collocation finders =====


import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords

TODO: * Create a Bigram Collocation Finder for the Brown Corpus. * Apply a filter to remove bigrams that occur less than two times. * Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter. * Print out the 20 most **frequent bigrams**. REPEAT THIS FOR BOTH: * Bigrams using a window of size 3. * Trigrams. ===== 2. Trying out different association measure functions ===== TODO: * Try out some other measure functions, at least ''pmi'', ''likelihood_ratio'', ''mi_like'', ''chi_sq''. * Print out the top 20 bigrams using each of the selected measure functions. * You can use ''help(BigramAssocMeasures.pmi)'' etc. and google to try to understand the difference between them. [[http://www.nltk.org/_modules/nltk/metrics/association.html]] ===== 3. Bigrams and tagged corpora ===== TODO: * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus. * Print out the 20 most frequent word/tag bigrams. * Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**. * Print out the 20 most frequent tag/tag bigrams. * Do you think any other association measures than raw frequency might be useful here? ===== 4. Correlations between association measure functions ===== The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare ''pmi'' agains the ''raw_freq'' of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare ''likelihood_ratio'' and ''mi_like'' etc.


finder = [...]
print('Correlation: %0.3f' % spearman_correlation(
    ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)),
    ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))

TODO: * Compare different association measures and see which are similar and which are farthest from the raw frequency. ===== 5. Working with Bigram scores ===== The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. **Note that this is not a problem to be solved so there is no need to hand anything in.**


import nltk.collocations
from nltk.corpus import brown
from collections import defaultdict

finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams(
    nltk.collocations.BigramAssocMeasures.likelihood_ratio)

#create a defaultdict of lists
prev_word = defaultdict(list)

#group by first word in bigram                                       
for key, scores in scored:
   prev_word[key[0]].append((key[1], scores))

#sort each list by highest association measure                                
for key in prev_word:
   prev_word[key].sort(key = lambda x: -x[1])

def what_comes_after( word, num):
    print(word.upper(), ': ', [prev_word[word][:num]])

#wait for it ...

what_comes_after('strong', 10)
what_comes_after('powerful', 10)

FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]]. ===== Solutions =====


import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords

#1

bam = BigramAssocMeasures

corpus =  brown.words()  # ok to use a subset e.g. 'ca01' for testing

finder = BigramCollocationFinder.from_words(corpus)

word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english')
#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english')


finder.apply_freq_filter(2)
finder.apply_word_filter(word_filter)

print(finder.nbest(bam.raw_freq, 20))


finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3)
finder_win3.apply_freq_filter(2)
finder_win3.apply_word_filter(word_filter)
print(finder_win3.nbest(bam.raw_freq, 20))


tam = TrigramAssocMeasures

finder_tri = TrigramCollocationFinder.from_words(corpus)
finder_tri.apply_freq_filter(2)
finder_tri.apply_word_filter(word_filter)
print(finder_tri.nbest(tam.raw_freq, 20))

#2

# Pointwise mutal information
print(finder.nbest(bam.pmi, 20))
# Log-likelihood ratio
print(finder.nbest(bam.likelihood_ratio, 20))
# Mutal information likelihood, a mi variant
print(finder.nbest(bam.mi_like, 20))
# Chi squared test
print(finder.nbest(bam.chi_sq, 20))          
# Student's t-test, w/independence hypothesis for unigrams
print(finder.nbest(bam.student_t, 20))          

#3

tagged_corpus = brown.tagged_words(tagset='universal')  # ok to use a subset e.g. 'ca01' for testing

finder_tagged = BigramCollocationFinder.from_words(tagged_corpus)
print(finder_tagged.nbest(bam.raw_freq, 20))

finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus)
print(finder_tags.nbest(bam.raw_freq, 20))