Lab 7
- 1. Training a tagger to Chunk NPs
- 2. A Regular Expression NP Chunker
  - 2.1 Improve the Regexp Chunker

Lab 7

Try to complete as much as you can. Hand in a python code file (fullName_lab7.py) with what you have finished in MySchool before midnight today (1 October).

If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.

Answer questions or report results in comments in your code.

Section 2 in chapter 7 contains information on Chunking.

1. Training a tagger to Chunk NPs

import nltk.chunk, nltk.tag

We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags.

from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

Use the UnigramChunker class from Example 3.1 in section 3.2 in Chapter 7

class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
 
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

Train an NP chunker on CoNLL 2000 corpus. Use the test_sents to evaluate out newly trained NP chunker.

chunker = UnigramChunker(train_sents)
 
print(chunker.evaluate(test_sents))

Try out the chunker on some sentences.

sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split());
 
print(chunker.parse(sentence))

1.1 Evaluate UnigramChunker

TODO: Report the evaluation results from using the UnigramChunker on the train_sents.

1.2 Create MyChunker and Evaluate it

TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in init (see 5.4 Combining taggers. Evaluate your new chunker and report the results.

1.3 Get a feel for NP chunking

TODO: Extract sentences from one of the many corpora available in NLTK or/and make up some sentences with complex NPs. Parse them with the chunker you created and examine the chunks.

2. A Regular Expression NP Chunker

pattern = '''
    NP: {<DT>?<NN>} #chunk rule: optional determiner followed by a noun
    '''
 
re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser
 
print(re_chunker.parse(sentence))

You can use the CoNLL test_sents to evaluate your hand-crafted NP chunker.

print(re_chunker.evaluate(test_sents))

2.1 Improve the Regexp Chunker

TODO: Improve upon the Regular Expressions NP chunking patterns. You could use the tag patterns you discovered in 1.3 to guide you. Section 2 in chapter 7 also contains some relevant examples.

pattern = '''
    NP: {[...]} #first NP chunk rule
        {[...]} #second NP chunk rule
        {[...]} #and so on ...
    '''

Note the special RE Syntax for Chunk Rules:

 Example:      Matches:

 {<A>}         #Tag A
 {<A|B>}       #Tag A or tag B
 {<A><B>}      #Tag A followed by tag B
 {<A>?<B.*>}   #Optional tag A followed by tab Bx where x is some optional character
 {<A><A><A>*}  #Two or more A tags in a row.

You can use nltk.help.upenn_tagset() to look up the tags.

NOTE: You should at least be able to get the IOB Accuracy for test_sents above 66%. Report you results.

Table of Contents