This is an old revision of the document!
Table of Contents
Lab 4
Try to complete as many problems as you can. Hand in a python code file (fullName_lab4.py
) with what you have finished in MySchool before midnight today (10 September).
If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.
See the NLTK Collocations howtow for help.
1. Bi- and Trigram Collocation finders
import nltk from nltk.collocations import * from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores from nltk.corpus import brown, stopwords
TODO:
- Create a Bigram Collocation Finder for the Brown Corpus.
- Apply a filter to remove bigrams that occur less than two times.
- Apply a filter to remove stopwords (
stopwords.words('english')
) and words that are two characters or shorter. - Print out the 20 most frequent bigrams.
REPEAT THIS FOR BOTH:
- Bigrams using a window of size 3.
- Trigrams.
2. Trying out different association measure functions
TODO:
- Try out some other measure functions, at least
pmi
,likelihood_ratio
,mi_like
,chi_sq
. - Print out the top 20 bigrams using each of the selected measure functions.
- You can use
help(BigramAssocMeasures.pmi)
etc. and google to try to understand the difference between them.
3. Bigrams and tagged corpora
TODO:
- Create a Bigram Collocation Finder for the tagged version of the Brown Corpus.
- Print out the 20 most frequent word/tag bigrams.
- Create a Bigram Collocation Finder for the tagged version of the Brown Corpus, using only the tags.
- Print out the 20 most frequent tag/tag bigrams.
- Do you think any other association measures than raw frequency might be useful here?
4. Correlations between association measure functions
The Spearman correlation coefficient (1 = the same, -1 the opposite) can be used to compare the different association measure for a corpus or text, e.g. compare pmi
agains the raw_freq
of the corpus. It can also come in handy when trying to understand the difference between the available association measures, e.g. compare likelihood_ratio
and mi_like
etc.
finder = [...] print('Correlation: %0.3f' % spearman_correlation( ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.pmi)), ranks_from_scores(finder.score_ngrams(BigramAssocMeasures.raw_freq))))
TODO:
- Compare different association measures and see which are similar and which are farthest from the raw frequency.
5. Working with Bigram scores
The following code snippet shows an example of how you can use score_ngrams
to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code.
import nltk.collocations from nltk.corpus import brown from collections import defaultdict finder = nltk.collocations.BigramCollocationFinder.from_words( nltk.corpus.brown.words()) scored = finder.score_ngrams( nltk.collocations.BigramAssocMeasures.likelihood_ratio) #group by first word in bigram prev_word = defaultdict(list) #a defaultdict of lists for key, scores in scored: prev_word[key[0]].append((key[1], scores)) #sort each list by highest association measure for key in prev_word: prev_word[key].sort(key = lambda x: -x[1]) def what_comes_after( word, num): print(word.upper(), ': ', [prev_word[word][:num]]) #wait for it ... what_comes_after('strong', 10) what_comes_after('powerful', 10)
FYI: a normal Python dictionary throws a KeyError
if you try to get an item with a key that is not currently in the dictionary. The defaultdict
in contrast will simply create any items that you try to access if they don't exist.