====== Lab 7 ======

**Try to complete as much as you can. Hand in a python code file (''fullName_lab7.py'') with what you have finished in MySchool before midnight today (1 October). **

If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error.

**Answer questions or report results in comments in your code.**

[[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7 contains information on Chunking]].

===== 1. Training a tagger to Chunk NPs =====

<code python>
import nltk.chunk, nltk.tag
</code>

We will be using the CoNLL 2000 corpus for training and test data since it contains chunk IOB tags (I-inside, O-outside, B-begin) as well as pos tags.

<code python>
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
</code>

Use the ''UnigramChunker'' class from Example 3.1 in [[http://www.nltk.org/book/ch07.html#simple-evaluation-and-baselines|section 3.2 in Chapter 7]] 

<code python>
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)
</code>

Train an NP chunker on CoNLL 2000 corpus. Use the ''test_sents'' to evaluate out newly trained NP chunker.

<code python>
chunker = UnigramChunker(train_sents)

print(chunker.evaluate(test_sents))
</code>

Try out the chunker on some sentences.

<code python>
sentence = nltk.pos_tag("While winter reigns the earth reposes but these colorless green ideas sleep furiously.".split());

print(chunker.parse(sentence))
</code>

==== 1.1 Evaluate UnigramChunker  ====

**TODO: Report the evaluation results from using the UnigramChunker on the train_sents.**


==== 1.2 Create MyChunker and Evaluate it ====

**TODO: Create a new class based on UnigramChunker that uses a trigram tagger with an unigram tagger as a backoff in ''__init__'' (see [[http://www.nltk.org/book/ch05.html#combining-taggers|5.4 Combining taggers]]. Evaluate your new chunker and report the results.**

==== 1.3 Get a feel for NP chunking ====

**TODO: Extract sentences from one of the many corpora available in NLTK or/and make up some sentences with complex NPs. Parse them with the chunker you created and examine the chunks. **


===== 2. A Regular Expression NP Chunker =====


<code python>
pattern = '''
    NP: {<DT>?<NN>} #chunk rule: optional determiner followed by a noun
    '''

re_chunker = nltk.RegexpParser(pattern) # create a re chunk parser

print(re_chunker.parse(sentence))
</code>

You can use the CoNLL ''test_sents'' to evaluate your hand-crafted NP chunker.

<code python>
print(re_chunker.evaluate(test_sents))
</code>

==== 2.1 Improve the Regexp Chunker ====

**TODO: Improve upon the Regular Expressions NP chunking patterns.** You could use the tag patterns you discovered in 1.3 to guide you. [[http://www.nltk.org/book/ch07.html#chunking|Section 2 in chapter 7]] also contains some relevant examples.

<code python>
pattern = '''
    NP: {[...]} #first NP chunk rule
        {[...]} #second NP chunk rule
        {[...]} #and so on ...
    '''
</code>

Note the special RE Syntax for Chunk Rules:
<code>
 Example:      Matches:

 {<A>}         #Tag A
 {<A|B>}       #Tag A or tag B
 {<A><B>}      #Tag A followed by tag B
 {<A>?<B.*>}   #Optional tag A followed by tab Bx where x is some optional character
 {<A><A><A>*}  #Two or more A tags in a row.
</code>

You can use ''nltk.help.upenn_tagset()'' to look up the tags.


**NOTE: You should at least be able to get the IOB Accuracy for ''test_sents'' above 66%. Report you results.**