User Tools

Site Tools


public:t-malv-15-3:4

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:t-malv-15-3:4 [2015/09/10 09:17] – [5. Working with Bigram scores] orvarkpublic:t-malv-15-3:4 [2024/04/29 13:33] (current) – external edit 127.0.0.1
Line 19: Line 19:
   * Create a Bigram Collocation Finder for the Brown Corpus.   * Create a Bigram Collocation Finder for the Brown Corpus.
   * Apply a filter to remove bigrams that occur less than two times.   * Apply a filter to remove bigrams that occur less than two times.
-  * Apply a filter to remove stopwords (''stopwords.words('english')'') and words that are two characters or shorter.+  * Apply a filter to remove bigrams that contain stopwords (''stopwords.words('english')'') and words that are two characters or shorter.
   * Print out the 20 most **frequent bigrams**.   * Print out the 20 most **frequent bigrams**.
  
Line 62: Line 62:
 ===== 5. Working with Bigram scores  ===== ===== 5. Working with Bigram scores  =====
  
-The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code.+The following code snippet shows an example of how you can use ''score_ngrams'' to get the bigram scores to work with them directly. Feel free to play around with it or try to improve on the code. **Note that this is not a problem to be solved so there is no need to hand anything in.**
  
 <code python> <code python>
Line 73: Line 73:
 scored = finder.score_ngrams( scored = finder.score_ngrams(
     nltk.collocations.BigramAssocMeasures.likelihood_ratio)     nltk.collocations.BigramAssocMeasures.likelihood_ratio)
 +
 +#create a defaultdict of lists
 +prev_word = defaultdict(list)
  
 #group by first word in bigram                                        #group by first word in bigram                                       
-prev_word = defaultdict(list) #a defaultdict of lists 
 for key, scores in scored: for key, scores in scored:
    prev_word[key[0]].append((key[1], scores))    prev_word[key[0]].append((key[1], scores))
Line 92: Line 94:
 </code> </code>
  
-FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist.+FYI: a normal Python dictionary throws a ''KeyError'' if you try to get an item with a key that is not currently in the dictionary. The ''defaultdict'' in contrast will simply create any items that you try to access if they don't exist. See [[http://www.nltk.org/book/ch05.html|section 3.4 in chapter 5 in the NLTK book]]. 
 + 
 +===== Solutions  ===== 
 + 
 +<code python> 
 +import nltk 
 +from nltk.collocations import * 
 +from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures, spearman_correlation, ranks_from_scores 
 +from nltk.corpus import brown, stopwords 
 + 
 +#1 
 + 
 +bam = BigramAssocMeasures 
 + 
 +corpus =  brown.words()  # ok to use a subset e.g. 'ca01' for testing 
 + 
 +finder = BigramCollocationFinder.from_words(corpus) 
 + 
 +word_filter = lambda w: len(w) < 3 or w.lower() in stopwords.words('english'
 +#def word_filter(w): return len(w) < 3 or w.lower() in stopwords.words('english'
 + 
 + 
 +finder.apply_freq_filter(2) 
 +finder.apply_word_filter(word_filter) 
 + 
 +print(finder.nbest(bam.raw_freq, 20)) 
 + 
 + 
 +finder_win3 = BigramCollocationFinder.from_words(corpus, window_size=3) 
 +finder_win3.apply_freq_filter(2) 
 +finder_win3.apply_word_filter(word_filter) 
 +print(finder_win3.nbest(bam.raw_freq, 20)) 
 + 
 + 
 +tam = TrigramAssocMeasures 
 + 
 +finder_tri = TrigramCollocationFinder.from_words(corpus) 
 +finder_tri.apply_freq_filter(2) 
 +finder_tri.apply_word_filter(word_filter) 
 +print(finder_tri.nbest(tam.raw_freq, 20)) 
 + 
 +#2 
 + 
 +# Pointwise mutal information 
 +print(finder.nbest(bam.pmi, 20)) 
 +# Log-likelihood ratio 
 +print(finder.nbest(bam.likelihood_ratio, 20)) 
 +# Mutal information likelihood, a mi variant 
 +print(finder.nbest(bam.mi_like, 20)) 
 +# Chi squared test 
 +print(finder.nbest(bam.chi_sq, 20))           
 +# Student's t-test, w/independence hypothesis for unigrams 
 +print(finder.nbest(bam.student_t, 20))           
 + 
 +#3 
 + 
 +tagged_corpus = brown.tagged_words(tagset='universal' # ok to use a subset e.g. 'ca01' for testing 
 + 
 +finder_tagged = BigramCollocationFinder.from_words(tagged_corpus) 
 +print(finder_tagged.nbest(bam.raw_freq, 20)) 
 + 
 +finder_tags = BigramCollocationFinder.from_words(t for w, t in tagged_corpus) 
 +print(finder_tags.nbest(bam.raw_freq, 20)) 
 +</code>
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/4.1441876678.txt.gz · Last modified: 2024/04/29 13:32 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki