This is an old revision of the document!

Lab 7

Try to complete as much as you can. Hand in a python code file (fullName_lab7.py) with what you have finished in MySchool before midnight today (1 October).

If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.

Answer questions or report results in comments in your code.

1. Training a tagger to Chunk NPs

import nltk.chunk, nltk.tag

We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags.

from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

Use the UnigramChunker class from Example 3.1 in section 3.2 in Chapter 7

class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
 
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

Train an NP chunker on CoNLL 2000 corpus. Use the test_sents to evaluate out newly trained NP chunker.

chunker = UnigramChunker(train_sents) #ChunkTagger(train_sents)
 
print(chunker.evaluate(test_sents))

Try out the chunker on some sentences.

sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split());
 
print(chunker.parse(sentence))

1.1 Evaluate UnigramChunker

TODO: Report the evaluation results from using the UnigramChunker on the train_sents.

1.2 Create MyChunker and Evaluate it

TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in init (see 5.4 Combining taggers. Evaluate your new chunker and report the results.

1.3 Get a feel for NP chunking

TODO: Make up some sentences with complex NPs (or extract sentences from one of the many corpora available in NLTK). Parse them with the chunker you created and examine the chunks.

2. A Regular Expression NP Chunker

pattern = '''
    NP: {<DT>?<NN>} #chunk rule: optional determiner followed by a noun
    '''
 
re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser
 
result = re_chunker.parse(sentence)
print(result)

You can use the CoNLL test_sents to evaluate your hand-crafted NP chunker.

print(re_chunker.evaluate(test_sents))

2.1 Improve the RE Chunker

TODO: Improve upon the re patterns. You could use the tag patterns you discovered in 1.3 to guide you.

pattern = '''
    NP: {[...]} #first NP chunk rule
        {[...]} #second NP chunk rule
        {[...]} #and so on ...
    '''

Note the special RE Syntax for Chunk Rules:

 Examples:     Matches:

 {<A>}         #Tag A
 {<A|B>}       #Tag A or tag B
 {<A><B>}      #Tag A followed by tag B
 {<A>?<B.*>}   #Optional tag A followed by tab Bx where x is some optional character
 {<A><A><A>*}  #Two or more A tags in a row.

You can use nltk.help.upenn_tagset() to look up the tags.

NOTE: You should be able to get the IOB Accuracy above 66%. Report you results.

Center for Analysis and Design of Intelligent Agents

Table of Contents