Table of Contents

Lab 2

Try to complete at least 3 of the 5 problems. Hand in your code in MySchool before midnight today (27 August). A single file, FullName_lab2.py containing the code in the same order as the given problems. You can use File→New File in IDLE to create the file.

If you can't manage to complete a particular problem please hand in your incomplete code – comment it out if it produces an error.

1. Tokenization with regular expressions

The Gutenberg corpus contains text we can work with. Lets select Moby Dick for today's lab.

>>> print(nltk.corpus.gutenberg.fileids())
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>> text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

NLTK includes a couple of tokenization functions which we will look at in next week (https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)).

tokens = nltk.wordpunct_tokenize(text)

TODO: Use the function findall from the python regular express package and write a regular expression to copy the functionality of the wordpunkt_tokenize() function.

import re
my_tokens = re.findall(r"###ADD RE###", text)

If you match the functionality perfectly you the two token list should be identical.

>>> len(tokens) - len(my_tokens)
0
>>> if my_tokens == tokens:
        print("Perfect solution!")
 
Perfect solution!

You can use the following code to find the first mismatch between the lists while you are honing your regular expression. It returns the index where the mismatch occurs and corresponding words from both lists. Note that it will through an error when the lists are identical.

next((idx, my_item, item) for idx, (my_item, item) in enumerate(zip(my_tokens, tokens)) if my_item != item)

You could wrap this code up in a helper function that also prints out the surrounding words of the mismatch in both lists (eg. print(my_tokens[idx-5:idx+5]))

Is there anything you thought should have been handled differently in the tokenization process?

2. Regular Expressions to match words

m = re.match(r'((\d+)\.(\d+))', '123.4567')

The function re.match() returns a Match Objects (see https://docs.python.org/2/library/re.html#re.MatchObject).

It is possible to check the truth value of the object:

if m:
    print('Found a match!')

The method .group(0) can be used to print the (last) match.

>>> m.group(0)
'123.4567'

If you use parenthesis to match more than one groups or subgroups they can be found in .group(1) and so on. You can use .groups() to see all the matches. It's even possible to name the groups and use the names instead of numbers.

>>> m.groups()
('123.4567', '123', '4567')

Now lets use re.match to examine the words in the list of tokens in part 1.

We can for example use it to try find all mentions of years in the text.

year_words = [w for w in tokens if re.match('[12]\d{3}', w)]

It's better to compile the regular expression if it's going to be used repeatedly, e.g. in a loop.

year_re = re.compile('[12]\d{3}')
year_words = [w for w in tokens if year_re.match(w)]

TODO: Examine the tokens list and try to find a couple (2+) of meaningful word groups to write a regular expression for. Then write them.

3. Frequency Distributions and n-grams

Chapter one introduced FreqDist(). We can use it to create a frequency distribution for the tokens in Moby Dick.

fd = nltk.FreqDist(tokens)

We can then look up the frequencies of individual words. Find the most frequent word and print a list of the most common ones.

>>> fd['whale']
906
>>> fd.max()
','
>>> fd.most_common(10)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]

This functionality can be used for more than single words.

TODO: Define a function that creates a frequency distribution for bigrams from list of tokens.*

def bigrams_from_tokens( token_list):
    bigram_fd = nltk.FreqDist()
 
    #add code
 
    return bigram_fd

Note that ConditionalFreqDist is not suitable for this task.

Use the function you define to print out the twenty most frequent bigrams in Moby Dick. How often does the bigram “Moby Dick” occur?

Example output:

>>> md_bigrams = bigrams_from_tokens(tokens)
>>> md_bigrams.most_common(10)
[(', and', 2607), ('of the', 1847), ("' s", 1737), ('in the', 1120), (', the', 908), ('; and', 853), ('to the', 712), ('. But', 596), (', that', 584), ('. "', 557)]
>>> md_bigrams['Moby Dick']
83

TODO: Now alter the function to make an n-gram version for extra credit.

Note that you have to make special arrangements for n > 2 to make sure that the n-grams do not cross over between sentences. One solution would be to copy the token_list and add '.' padding (n-2 times) after punctuation.

def ngrams_from_tokens( token_list, n=2):
    ngram_fd = FreqDist()
 
    #copy and pad at both ends
    #add padding after punctuation
 
    #loop through the list and update ngram_fd
 
    return ngram_fd

4. Initial letter frequency for males vs. females

Define a conditional frequency distribution over the Names corpus that allows you to see which initial letters are more frequent for males vs. females (cf. 4.4 in chapter 2 http://www.nltk.org/book/ch02.html#fig-cfd-gender).

(Problem 8 in chapter 2: http://www.nltk.org/book/ch02.html)

5. find_language()

Define a function find_language() that takes a string as its argument, and returns a list of languages that have that string as a word. Use the udhr corpus and limit your searches to files in the Latin-1 encoding.

(Problem 25 in chapter 2: http://www.nltk.org/book/ch02.html)

Possible Solutions

#1
\w+|[^\w\s]+   # one or more word-chars or not (word-chars or whitespace)
 
#2
^[A-Z]+$       # uppercase words, one or more uppercase letters
^\d\D+$        # ordinal numbers (1st, 2nd, 3rd ...), a digit followed by one or more non-digit
 
#3
def bigrams_from_tokens( token_list):
    bigram_fd = nltk.FreqDist()
 
    bigram_fd['. '+token_list[0]] = 1  # first bigram ('.' padding + first word)
 
    for i in range(1, len(token_list)):
        bigram_fd[token_list[i-1] + ' ' + token_list[i]] += 1
 
    return bigram_fd
 
#This problem can be solved in numerous other ways. E.g. using the bigram function mentioned in chapter
#one to create a list of bigram tuples that are then fed into the FreqDist, or simply used as keys.
 
#4
names = nltk.corpus.names
 
cfd = nltk.ConditionalFreqDist()
for w in names.words('male.txt'):
    cfd['male'][w[0]] += 1
for w in names.words('female.txt'):
    cfd['female'][w[0]] += 1
 
#one liner version (creates a list of tuples (gender, initial) that the cfd is initialized with):
#cfd=nltk.ConditionalFreqDist((file[:-4],name[0])
#                             for file in names.fileids()
#                             for name in names.words(file))
 
#output the results
#cfd.tabulate()
#cfd.plot()
 
#5
 
def find_language(word):
    latin1_langs = [w for w in nltk.corpus.udhr.fileids() if w.endswith('-Latin1')]
    return [l[:-7] for l in latin1_langs if word in nltk.corpus.udhr.words(l)]
 
#>>> find_language('að')
#['Icelandic_Yslenska']