====== Lab 3 ======

**Try to complete as many of the problems as you can. Hand in your code files with what you have done in MySchool before midnight today (3 September). **

The first and the last problems should be in a file named ''lab3.py'', the others in ''myscript.py'', ''mytokenize.py'' and ''rot13.py''.

If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error.

===== 1. String slicing and stepping =====

<code python>
monty = 'Monty Python'
</code>

We can specify a "step" size for the slice. The following returns every second character within the slice: ''monty[6:11:2]''. It also works in the reverse direction: ''monty[10:5:-2]'' **Try these for yourself, then experiment with different step values.**

**What happens if you ask the interpreter to evaluate ''monty[::-1]''? Explain why this is a reasonable result.**

(Problem 4 and 5 in [[http://www.nltk.org/book/ch03.html|Chapter 3]])

===== 2. myscript.py: argv =====

<code python>
from sys import argv

print(argv)
</code>

The list ''argv'' contains the name of the script plus any parameters used in the invocation.

<code>
$ python myscript.py One TWO three
['myscript.py', 'One', 'TWO', 'three']
</code>

If the number of parameters is fixed beforehand you can unpack them all into variables. Otherwise you use the index number to get a particular parameter.

<code python>
first_param = argv[1]    #using the index

script_name, first, second, third = argv     #unpacking the argv list into four variables
</code>

**Create a script named ''myscript.py'' that produces the following output when executed with the parameters indicated:**

<code>
$ python myscript.py file1.txt file2.txt
Number of parameters: 2
Script name: myscript.py
First parameter: file1.txt
Second parameter: file2.txt
</code>

**NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simplest solution might be to install python again (choose "Change Python") and then make sure "Add python.exe to Path" is selected (last option undir "Customize Python").**

{{:public:t-malv-15-3:python-path-install.png?direct&200|}}

([[https://docs.python.org/3.3/using/cmdline.html#using-on-cmdline|Information on executing python scripts in Windows]])

===== 3. mytokenize.py: Read file =====

**Create a script name ''mytokenize.py'' that reads a file contents, tokenizes them, removes stopwords and print out the remaining tokens, one per line.**

<code python>
from sys import argv
from nltk import word_tokenize
from nltk.corpus import stopwords

#Get file name from argv (see problem 3).
#Open file for reading.
#Read contents into a string.
#Tokenize the string.
#Remove stopwords (words in stopwords.words('english')).
#Print out the tokens, one per line.
</code>

You should be able to invoke the script using ''python mytokenize.py test.txt''.


===== 4. rot13.py: Read and Write file =====

**Now create a script named ''rot13.py'' that reads the contents from one file, line by line and alters the lines with a simple algorithm before writing them to another file.**

<code python>
from sys import argv
from codecs import encode

#Get two file names from argv (see problem 3).
#Open file1 for reading.
#Open file2 for writing.
#  Loop; read one line from file1.
     line = encode(line, 'rot_13')
#    Write the line to file2.
#(Close file2)
#(Close file1)
</code>

You should be able to invoke the script using ''python rot13.py input.txt output.txt'' for example. Copy some text and put into a test file.

The function used to alter the lines is ''encode(string, 'rot_13')'' from ''codecs'', see [[https://en.wikipedia.org/wiki/ROT13|ROT13 on Wikipedia]]

<code python>
>>> from codecs import encode
>>> print(encode('nun', 'rot_13'))
aha
</code>


===== 5. List comprehension ====

**Rewrite the following loop as a "list comprehension"**:

<code python>
>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
</code>

(Problem 10 in [[http://www.nltk.org/book/ch03.html|Chapter 3]]).

List comprehensions enable the descriptive construction of lists in a very compact, yet easily readable way. See some examples on [[http://www.secnetix.de/olli/Python/list_comprehensions.hawk|this page]] or [[https://www.google.com/?q=list+comprehension+python#safe=off&q=list+comprehension+python|google]] for others.

{{:public:t-malv-15-3:listcomprehensions.gif?nolink&610|}}

**If you feel this problem is easy you should also try your hand at problems 31 and 41.**

===== Possible Solutions =====

<code python>
#1
>>> monty[::-1] == 'nohtyP ytnoM'
True

#2
from sys import argv

print('Number of parameters: ', len(argv)-1)
print('Script name: ', argv[0])
print('First parameter: ', argv[1])
print('Second parameter: ', argv[2])

#3
from sys import argv
from nltk import word_tokenize
from nltk.corpus import stopwords

with open(argv[1]) as infile:
    for w in word_tokenize(infile.read()):
        if w.lower() not in stopwords.words('english'):
            print(w)

#Since files are context managers, they can be used in a with-statement.
#The file will close when the code block is finished, even if an exception occurs

#4
from sys import argv
from codecs import encode

with open(argv[1]) as infile, open(argv[2], 'w') as outfile:
    for line in infile:
        outfile.write(encode(line, 'rot_13'))

#5
[(w, len(w)) for w in sent]
</code>