====== Lab 7 ======
**Try to complete as much as you can. Hand in a python code file (''fullName_lab7.py'') with what you have finished in MySchool before midnight today (1 October). **
If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error.
**Answer questions or report results in comments in your code.**
[[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7 contains information on Chunking]].
===== 1. Training a tagger to Chunk NPs =====
import nltk.chunk, nltk.tag
We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags.
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
Use the ''UnigramChunker'' class from Example 3.1 in [[http://www.nltk.org/book/ch07.html#simple-evaluation-and-baselines|section 3.2 in Chapter 7]]
class UnigramChunker(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.UnigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
Train an NP chunker on CoNLL 2000 corpus. Use the ''test_sents'' to evaluate out newly trained NP chunker.
chunker = UnigramChunker(train_sents)
print(chunker.evaluate(test_sents))
Try out the chunker on some sentences.
sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split());
print(chunker.parse(sentence))
==== 1.1 Evaluate UnigramChunker ====
**TODO: Report the evaluation results from using the UnigramChunker on the train_sents.**
==== 1.2 Create MyChunker and Evaluate it ====
**TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in ''__init__'' (see [[http://www.nltk.org/book/ch05.html#combining-taggers|5.4 Combining taggers]]. Evaluate your new chunker and report the results.**
==== 1.3 Get a feel for NP chunking ====
**TODO: Extract sentences from one of the many corpora available in NLTK or/and make up some sentences with complex NPs. Parse them with the chunker you created and examine the chunks. **
===== 2. A Regular Expression NP Chunker =====
pattern = '''
NP: {?} #chunk rule: optional determiner followed by a noun
'''
re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser
print(re_chunker.parse(sentence))
You can use the CoNLL ''test_sents'' to evaluate your hand-crafted NP chunker.
print(re_chunker.evaluate(test_sents))
==== 2.1 Improve the Regexp Chunker ====
**TODO: Improve upon the Regular Expressions NP chunking patterns.** You could use the tag patterns you discovered in 1.3 to guide you. [[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7]] also contains some relevant examples.
pattern = '''
NP: {[...]} #first NP chunk rule
{[...]} #second NP chunk rule
{[...]} #and so on ...
'''
Note the special RE Syntax for Chunk Rules:
Example: Matches:
{} #Tag A
{} #Tag A or tag B
{} #Tag A followed by tag B
{?} #Optional tag A followed by tab Bx where x is some optional character
{*} #Two or more A tags in a row.
You can use ''nltk.help.upenn_tagset()'' to look up the tags.
**NOTE: You should at least be able to get the IOB Accuracy for ''test_sents'' above 66%. Report you results.**