Differences

This shows you the differences between two versions of the page.

--- public:t-malv-15-3:5 [2015/09/17 07:39] – [1. Down the Garden Path] orvark
+++ public:t-malv-15-3:5 [2024/04/29 13:33] (current) – external edit 127.0.0.1
@@ Line 30: / Line 30: @@
 See if making minor changes to the wording of the sentences is enough for the tagger to tag it correctly. For example adding ''that was'' after ''horse'' in the example above.
-===== 2. Training and Testing Data and Finding the Baseline  =====
+===== 2. Training and Testing Data, and Finding the Baseline  =====
 <code python>
@@ Line 41: / Line 41: @@
 Train the **Unigram Tagger** on the training sentences using **Default Tagger** as backoff. Use 'NN' as the default tag.
-Evaluate the taggers performance on the testing sentences. How well does it do? Do you think this is a fair baseline the measure other taggers against?
+Evaluate the taggers performance on the testing sentences. How well does it do? Do you think this is a fair baseline to measure other taggers against?
 ===== 3. A Cascade of Taggers =====
@@ Line 55: / Line 55: @@
 </code>
-Evalute both ''t2'' and ''t3'' and report the difference.
+Evaluate both ''t2'' and ''t3'' and report the difference. You can also evaluate ''t0'' and ''t1'' to see what each step brings to the overall result.
 ===== 4. Most frequent tag of Hapax legomenon  =====
@@ Line 61: / Line 61: @@
 Instead of using the most common tag overall for the Default Tagger some say that [[https://en.wikipedia.org/wiki/Hapax_legomenon|hapax legomenon]] is a better model for unknown words. That is, that words that occur only once in the corpus are likely to be representative of the words that never occur, the unseen words.
-See if you can write code to find the most common tag in the set of words that occur only once in the Brown corpus. You might prefer to use ''brown.tagged_words()'' here, and even ''brown.tagged_words(categories='news')'' during testing.
+See if you can write code to find the most common tags in the set of words that occur only once in the Brown corpus. You might prefer to use ''brown.tagged_words()'' here, and even ''brown.tagged_words(categories='news')'' during testing.
 What is the most common tag of //hapax legomenon//?
+Looking at the twenty most common tags, how do you think the difference between the overall model and hapax legomenon model will develop as the training corpus grows larger?