This is an old revision of the document!
Table of Contents
Lab 7
Try to complete as much as you can. Hand in a python code file (fullName_lab7.py
) with what you have finished in MySchool before midnight today (1 October).
If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.
Answer questions or report results in comments in your code.
1. Training a tagger to Chunk NPs
import nltk.chunk, nltk.tag
We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags.
from nltk.corpus import conll2000 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
Use the UnigramChunker
class from Example 3.1 in section 3.2 in Chapter 7
class UnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags)
Train an NP chunker on CoNLL 2000 corpus. Use the test_sents
to evaluate out newly trained NP chunker.
chunker = UnigramChunker(train_sents) #ChunkTagger(train_sents) print(chunker.evaluate(test_sents))
Try out the chunker on some sentences.
sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split()); print(chunker.parse(sentence))
1.1 Evaluate UnigramChunker
TODO: Report the evaluation results from using the UnigramChunker on the train_sents.
1.2 Create MyChunker and Evaluate it
TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in init
(see 5.4 Combining taggers. Evaluate your new chunker and report the results.
1.3 Get a feel for NP chunking
TODO: Make up some sentences with complex NPs (or extract sentences from one of the many corpora available in NLTK). Parse them with the chunker you created and examine the chunks.
2. A Regular Expression NP Chunker
pattern = ''' NP: {<DT>?<NN>} #chunk rule: optional determiner followed by a noun ''' re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser result = re_chunker.parse(sentence) print(result)
You can use the CoNLL test_sents
to evaluate your hand-crafted NP chunker.
print(re_chunker.evaluate(test_sents))
2.1 Improve the RE Chunker
TODO: Improve upon the re patterns. You could use the tag patterns you discovered in 1.3 to guide you.
pattern = ''' NP: {[...]} #first NP chunk rule {[...]} #second NP chunk rule {[...]} #and so on ... '''
Note the special RE Syntax for Chunk Rules:
Examples: Matches: {<A>} #Tag A {<A|B>} #Tag A or tag B {<A><B>} #Tag A followed by tag B {<A>?<B.*>} #Optional tag A followed by tab Bx where x is some optional character {<A><A><A>*} #Two or more A tags in a row.
You can use nltk.help.upenn_tagset()
to look up the tags.
NOTE: You should be able to get the IOB Accuracy above 66%. Report you results.