Differences

This shows you the differences between two versions of the page.

--- public:t-malv-15-3:4 [2015/09/10 09:14] – [5. Working with Bigram scores] orvark
+++ public:t-malv-15-3:4 [2024/04/29 13:33] (current) – external edit 127.0.0.1
@@ Line 19: / Line 19: @@
   * Create a Bigram Collocation Finder for the Brown Corpus.
   * Apply a filter to remove bigrams that occur less than two times.
-  * Apply a filter to remove stopwords (''stopwords.words('english')'') and words that are two characters or shorter.
+  * Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter.
   * Print out the 20 most **frequent bigrams**.
@@ Line 62: / Line 62: @@
 ===== 5. Working with Bigram scores  =====
-The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code.
+The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. **Note that this is not a problem to be solved so there is no need to hand anything in.**
 <code python>
@@ Line 74: / Line 74: @@
     nltk.collocations.BigramAssocMeasures.likelihood_ratio)
-#group by first word in bigram
+#create a defaultdict of lists
 prev_word = defaultdict(list)
+#group by first word in bigram
 for key, scores in scored:
    prev_word[key[0]].append((key[1], scores))
-#sort by strongest association
+#sort each list by highest association measure
 for key in prev_word:
    prev_word[key].sort(key = lambda x: -x[1])
@@ Line 92: / Line 94: @@
 </code>
-FYI: a normal Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they dont exist.
+FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]].
+===== Solutions  =====
+<code python>
+import nltk
+from nltk.collocations import *
+from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores
+from nltk.corpus import brown, stopwords
+#1
+bam = BigramAssocMeasures
+corpus =  brown.words()  # ok to use a subset e.g. 'ca01' for testing
+finder = BigramCollocationFinder.from_words(corpus)
+word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english')
+#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english')
+finder.apply_freq_filter(2)
+finder.apply_word_filter(word_filter)
+print(finder.nbest(bam.raw_freq, 20))
+finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3)
+finder_win3.apply_freq_filter(2)
+finder_win3.apply_word_filter(word_filter)
+print(finder_win3.nbest(bam.raw_freq, 20))
+tam = TrigramAssocMeasures
+finder_tri = TrigramCollocationFinder.from_words(corpus)
+finder_tri.apply_freq_filter(2)
+finder_tri.apply_word_filter(word_filter)
+print(finder_tri.nbest(tam.raw_freq, 20))
+#2
+# Pointwise mutal information
+print(finder.nbest(bam.pmi, 20))
+# Log-likelihood ratio
+print(finder.nbest(bam.likelihood_ratio, 20))
+# Mutal information likelihood, a mi variant
+print(finder.nbest(bam.mi_like, 20))
+# Chi squared test
+print(finder.nbest(bam.chi_sq, 20))
+# Student's t-test, w/independence hypothesis for unigrams
+print(finder.nbest(bam.student_t, 20))
+#3
+tagged_corpus = brown.tagged_words(tagset='universal')  # ok to use a subset e.g. 'ca01' for testing
+finder_tagged = BigramCollocationFinder.from_words(tagged_corpus)
+print(finder_tagged.nbest(bam.raw_freq, 20))
+finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus)
+print(finder_tags.nbest(bam.raw_freq, 20))
+</code>