User Tools

Site Tools


public:t-malv-15-3:4

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

public:t-malv-15-3:4 [2015/09/10 22:34]
orvark [5. Working with Bigram scores]
public:t-malv-15-3:4 [2015/09/14 11:39] (current)
orvark
Line 95: Line 95:
  
 FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]]. FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]].
 +
 +===== Solutions  =====
 +
 +<code python>
 +import nltk
 +from nltk.collocations import *
 +from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
 +from nltk.corpus import brown, stopwords
 +
 +#1
 +
 +bam = BigramAssocMeasures
 +
 +corpus =  brown.words()  # ok to use a subset e.g. 'ca01' for testing
 +
 +finder = BigramCollocationFinder.from_words(corpus)
 +
 +word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english')
 +#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english')
 +
 +
 +finder.apply_freq_filter(2)
 +finder.apply_word_filter(word_filter)
 +
 +print(finder.nbest(bam.raw_freq, 20))
 +
 +
 +finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3)
 +finder_win3.apply_freq_filter(2)
 +finder_win3.apply_word_filter(word_filter)
 +print(finder_win3.nbest(bam.raw_freq, 20))
 +
 +
 +tam = TrigramAssocMeasures
 +
 +finder_tri = TrigramCollocationFinder.from_words(corpus)
 +finder_tri.apply_freq_filter(2)
 +finder_tri.apply_word_filter(word_filter)
 +print(finder_tri.nbest(tam.raw_freq, 20))
 +
 +#2
 +
 +# Pointwise mutal information
 +print(finder.nbest(bam.pmi, 20))
 +# Log-likelihood ratio
 +print(finder.nbest(bam.likelihood_ratio, 20))
 +# Mutal information likelihood, a mi variant
 +print(finder.nbest(bam.mi_like, 20))
 +# Chi squared test
 +print(finder.nbest(bam.chi_sq, 20))          
 +# Student's t-test, w/independence hypothesis for unigrams
 +print(finder.nbest(bam.student_t, 20))          
 +
 +#3
 +
 +tagged_corpus = brown.tagged_words(tagset='universal')  # ok to use a subset e.g. 'ca01' for testing
 +
 +finder_tagged = BigramCollocationFinder.from_words(tagged_corpus)
 +print(finder_tagged.nbest(bam.raw_freq, 20))
 +
 +finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus)
 +print(finder_tags.nbest(bam.raw_freq, 20))
 +</code>
/var/www/ailab/WWW/wiki/data/pages/public/t-malv-15-3/4.txt ยท Last modified: 2015/09/14 11:39 by orvark