Lab 5

Try to complete as many problems as you can. Hand in a python code file ( with what you have finished in MySchool before midnight today (17 September).

If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.

1. Down the Garden Path

Lets get started by trying out the POS-Tagger in NLTK. See if you can think of some ambiguous sentences that confuse the tagger – or google for some garden path sentences.

import nltk
text = nltk.word_tokenize("The horse raced past the barn fell.")
tagged_text = nltk.pos_tag(text)

Use to look up the meaning of the tags.'NN')'NN.*')
#fyi, there is also a lookup function for the tagset used for the Brown corpus:'NN')

See if making minor changes to the wording of the sentences is enough for the tagger to tag it correctly. For example adding that was after horse in the example above.

2. Training and Testing Data, and Finding the Baseline

from nltk.corpus import brown

Get the tagged sentences from the Brown corpus and separate them into training (90%) and testing parts (10%). (Note: we will use them also in the two remaining problems). You can limit the corpus to the news category during testing but use the whole corpus for the handin.

Train the Unigram Tagger on the training sentences using Default Tagger as backoff. Use 'NN' as the default tag.

Evaluate the taggers performance on the testing sentences. How well does it do? Do you think this is a fair baseline to measure other taggers against?

3. A Cascade of Taggers

Extend the tagger combination in 5.4 Combining Taggers by defining a TrigramTagger called t3, which backs off to t2 as suggested.

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(brown_train_sents, backoff=t0)
t2 = nltk.BigramTagger(brown_train_sents, backoff=t1)

Evaluate both t2 and t3 and report the difference. You can also evaluate t0 and t1 to see what each step brings to the overall result.

4. Most frequent tag of Hapax legomenon

Instead of using the most common tag overall for the Default Tagger some say that hapax legomenon is a better model for unknown words. That is, that words that occur only once in the corpus are likely to be representative of the words that never occur, the unseen words.

See if you can write code to find the most common tags in the set of words that occur only once in the Brown corpus. You might prefer to use brown.tagged_words() here, and even brown.tagged_words(categories='news') during testing.

What is the most common tag of hapax legomenon?

Looking at the twenty most common tags, how do you think the difference between the overall model and hapax legomenon model will develop as the training corpus grows larger?