====== Lab 4 ======
**Try to complete as many problems as you can. Hand in a python code file (''fullName_lab4.py'') with what you have finished in MySchool before midnight today (10 September). **
If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error.
See the [[http://www.nltk.org/howto/collocations.html|NLTK Collocations howtow]] for help.
===== 1. Bi- and Trigram Collocation finders =====
import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords
TODO:
* Create a Bigram Collocation Finder for the Brown Corpus.
* Apply a filter to remove bigrams that occur less than two times.
* Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter.
* Print out the 20 most **frequent bigrams**.
REPEAT THIS FOR BOTH:
* Bigrams using a window of size 3.
* Trigrams.
===== 2. Trying out different association measure functions =====
TODO:
* Try out some other measure functions, at least ''pmi'', ''likelihood_ratio'', ''mi_like'', ''chi_sq''.
* Print out the top 20 bigrams using each of the selected measure functions.
* You can use ''help(BigramAssocMeasures.pmi)'' etc. and google to try to understand the difference between them.
[[http://www.nltk.org/_modules/nltk/metrics/association.html]]
===== 3. Bigrams and tagged corpora =====
TODO:
* Create a Bigram Collocation Finder for the tagged version of the Brown Corpus.
* Print out the 20 most frequent word/tag bigrams.
* Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, **using only the tags**.
* Print out the 20 most frequent tag/tag bigrams.
* Do you think any other association measures than raw frequency might be useful here?
===== 4. Correlations between association measure functions =====
The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare ''pmi'' agains the ''raw_freq'' of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare ''likelihood_ratio'' and ''mi_like'' etc.
finder = [...]
print('Correlation: %0.3f' % spearman_correlation(
ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)),
ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))
TODO:
* Compare different association measures and see which are similar and which are farthest from the raw frequency.
===== 5. Working with Bigram scores =====
The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. **Note that this is not a problem to be solved so there is no need to hand anything in.**
import nltk.collocations
from nltk.corpus import brown
from collections import defaultdict
finder = nltk.collocations.BigramCollocationFinder.from_words(
nltk.corpus.brown.words())
scored = finder.score_ngrams(
nltk.collocations.BigramAssocMeasures.likelihood_ratio)
#create a defaultdict of lists
prev_word = defaultdict(list)
#group by first word in bigram
for key, scores in scored:
prev_word[key[0]].append((key[1], scores))
#sort each list by highest association measure
for key in prev_word:
prev_word[key].sort(key = lambda x: -x[1])
def what_comes_after( word, num):
print(word.upper(), ': ', [prev_word[word][:num]])
#wait for it ...
what_comes_after('strong', 10)
what_comes_after('powerful', 10)
FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]].
===== Solutions =====
import nltk
from nltk.collocations import *
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
from nltk.corpus import brown, stopwords
#1
bam = BigramAssocMeasures
corpus = brown.words() # ok to use a subset e.g. 'ca01' for testing
finder = BigramCollocationFinder.from_words(corpus)
word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english')
#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english')
finder.apply_freq_filter(2)
finder.apply_word_filter(word_filter)
print(finder.nbest(bam.raw_freq, 20))
finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3)
finder_win3.apply_freq_filter(2)
finder_win3.apply_word_filter(word_filter)
print(finder_win3.nbest(bam.raw_freq, 20))
tam = TrigramAssocMeasures
finder_tri = TrigramCollocationFinder.from_words(corpus)
finder_tri.apply_freq_filter(2)
finder_tri.apply_word_filter(word_filter)
print(finder_tri.nbest(tam.raw_freq, 20))
#2
# Pointwise mutal information
print(finder.nbest(bam.pmi, 20))
# Log-likelihood ratio
print(finder.nbest(bam.likelihood_ratio, 20))
# Mutal information likelihood, a mi variant
print(finder.nbest(bam.mi_like, 20))
# Chi squared test
print(finder.nbest(bam.chi_sq, 20))
# Student's t-test, w/independence hypothesis for unigrams
print(finder.nbest(bam.student_t, 20))
#3
tagged_corpus = brown.tagged_words(tagset='universal') # ok to use a subset e.g. 'ca01' for testing
finder_tagged = BigramCollocationFinder.from_words(tagged_corpus)
print(finder_tagged.nbest(bam.raw_freq, 20))
finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus)
print(finder_tags.nbest(bam.raw_freq, 20))