Lab 4

Lab 4

Try to complete as many problems as you can. Hand in a python code file (fullName_lab4.py) with what you have finished in MySchool before midnight today (10 September).

If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.

See the NLTK Collocations howtow for help.

1. Bi- and Trigram Collocation finders

import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords

TODO:

Create a Bigram Collocation Finder for the Brown Corpus.
Apply a filter to remove bigrams that occur less than two times.
Apply a filter to remove bigrams that contain stopwords (stopwords.words('english')) and words that are two characters or shorter.
Print out the 20 most frequent bigrams.

REPEAT THIS FOR BOTH:

Bigrams using a window of size 3.
Trigrams.

2. Trying out different association measure functions

TODO:

Try out some other measure functions, at least pmi, likelihood_ratio, mi_like, chi_sq.
Print out the top 20 bigrams using each of the selected measure functions.
You can use help(BigramAssocMeasures.pmi) etc. and google to try to understand the difference between them.

http://www.nltk.org/_modules/nltk/metrics/association.html

3. Bigrams and tagged corpora

TODO:

Create a Bigram Collocation Finder for the tagged version of the Brown Corpus.
Print out the 20 most frequent word/tag bigrams.

Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, using only the tags.
Print out the 20 most frequent tag/tag bigrams.

Do you think any other association measures than raw frequency might be useful here?

4. Correlations between association measure functions

The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare pmi agains the raw_freq of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare likelihood_ratio and mi_like etc.

finder = [...]
print('Correlation: %0.3f' % spearman_correlation(
    ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)),
    ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))

TODO:

Compare different association measures and see which are similar and which are farthest from the raw frequency.

5. Working with Bigram scores

The following code snippet shows an example of how you can use score_ngrams to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. Note that this is not a problem to be solved so there is no need to hand anything in.

import nltk.collocations
from nltk.corpus import brown
from collections import defaultdict
 
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams(
    nltk.collocations.BigramAssocMeasures.likelihood_ratio)
 
#create a defaultdict of lists
prev_word = defaultdict(list)
 
#group by first word in bigram                                       
for key, scores in scored:
   prev_word[key[0]].append((key[1], scores))
 
#sort each list by highest association measure                                
for key in prev_word:
   prev_word[key].sort(key = lambda x: -x[1])
 
def what_comes_after( word, num):
    print(word.upper(), ': ', [prev_word[word][:num]])
 
#wait for it ...
 
what_comes_after('strong', 10)
what_comes_after('powerful', 10)

FYI: a normal Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The defaultdict in contrast will simply create any items that you try to access if they don't exist. See section 3.4 in chapter 5 in the NLTK book.

Solutions

import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords
 
#1
 
bam = BigramAssocMeasures
 
corpus =  brown.words()  # ok to use a subset e.g. 'ca01' for testing
 
finder = BigramCollocationFinder.from_words(corpus)
 
word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english')
#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english')
 
 
finder.apply_freq_filter(2)
finder.apply_word_filter(word_filter)
 
print(finder.nbest(bam.raw_freq, 20))
 
 
finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3)
finder_win3.apply_freq_filter(2)
finder_win3.apply_word_filter(word_filter)
print(finder_win3.nbest(bam.raw_freq, 20))
 
 
tam = TrigramAssocMeasures
 
finder_tri = TrigramCollocationFinder.from_words(corpus)
finder_tri.apply_freq_filter(2)
finder_tri.apply_word_filter(word_filter)
print(finder_tri.nbest(tam.raw_freq, 20))
 
#2
 
# Pointwise mutal information
print(finder.nbest(bam.pmi, 20))
# Log-likelihood ratio
print(finder.nbest(bam.likelihood_ratio, 20))
# Mutal information likelihood, a mi variant
print(finder.nbest(bam.mi_like, 20))
# Chi squared test
print(finder.nbest(bam.chi_sq, 20))          
# Student's t-test, w/independence hypothesis for unigrams
print(finder.nbest(bam.student_t, 20))          
 
#3
 
tagged_corpus = brown.tagged_words(tagset='universal')  # ok to use a subset e.g. 'ca01' for testing
 
finder_tagged = BigramCollocationFinder.from_words(tagged_corpus)
print(finder_tagged.nbest(bam.raw_freq, 20))
 
finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus)
print(finder_tags.nbest(bam.raw_freq, 20))

Table of Contents

Lab 4

1. Bi- and Trigram Collocation finders

2. Trying out different association measure functions

3. Bigrams and tagged corpora

4. Correlations between association measure functions

5. Working with Bigram scores

Solutions