====== Lab 7 ====== **Try to complete as much as you can. Hand in a python code file (''fullName_lab7.py'') with what you have finished in MySchool before midnight today (1 October). ** If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. **Answer questions or report results in comments in your code.** [[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7 contains information on Chunking]]. ===== 1. Training a tagger to Chunk NPs ===== import nltk.chunk, nltk.tag We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags. from nltk.corpus import conll2000 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) Use the ''UnigramChunker'' class from Example 3.1 in [[http://www.nltk.org/book/ch07.html#simple-evaluation-and-baselines|section 3.2 in Chapter 7]] class UnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags) Train an NP chunker on CoNLL 2000 corpus. Use the ''test_sents'' to evaluate out newly trained NP chunker. chunker = UnigramChunker(train_sents) print(chunker.evaluate(test_sents)) Try out the chunker on some sentences. sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split()); print(chunker.parse(sentence)) ==== 1.1 Evaluate UnigramChunker ==== **TODO: Report the evaluation results from using the UnigramChunker on the train_sents.** ==== 1.2 Create MyChunker and Evaluate it ==== **TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in ''__init__'' (see [[http://www.nltk.org/book/ch05.html#combining-taggers|5.4 Combining taggers]]. Evaluate your new chunker and report the results.** ==== 1.3 Get a feel for NP chunking ==== **TODO: Extract sentences from one of the many corpora available in NLTK or/and make up some sentences with complex NPs. Parse them with the chunker you created and examine the chunks. ** ===== 2. A Regular Expression NP Chunker ===== pattern = ''' NP: {
?} #chunk rule: optional determiner followed by a noun ''' re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser print(re_chunker.parse(sentence)) You can use the CoNLL ''test_sents'' to evaluate your hand-crafted NP chunker. print(re_chunker.evaluate(test_sents)) ==== 2.1 Improve the Regexp Chunker ==== **TODO: Improve upon the Regular Expressions NP chunking patterns.** You could use the tag patterns you discovered in 1.3 to guide you. [[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7]] also contains some relevant examples. pattern = ''' NP: {[...]} #first NP chunk rule {[...]} #second NP chunk rule {[...]} #and so on ... ''' Note the special RE Syntax for Chunk Rules: Example: Matches: {} #Tag A {} #Tag A or tag B {} #Tag A followed by tag B {?} #Optional tag A followed by tab Bx where x is some optional character {*} #Two or more A tags in a row. You can use ''nltk.help.upenn_tagset()'' to look up the tags. **NOTE: You should at least be able to get the IOB Accuracy for ''test_sents'' above 66%. Report you results.**