====== Lab 3 ====== **Try to complete as many of the problems as you can. Hand in your code files with what you have done in MySchool before midnight today (3 September). ** The first and the last problems should be in a file named ''lab3.py'', the others in ''myscript.py'', ''mytokenize.py'' and ''rot13.py''. If you can't manage to complete a particular problem please hand in your incomplete code -- comment it out if it produces an error. ===== 1. String slicing and stepping ===== monty = 'Monty Python' We can specify a "step" size for the slice. The following returns every second character within the slice: ''monty[6:11:2]''. It also works in the reverse direction: ''monty[10:5:-2]'' **Try these for yourself, then experiment with different step values.** **What happens if you ask the interpreter to evaluate ''monty[::-1]''? Explain why this is a reasonable result.** (Problem 4 and 5 in [[http://www.nltk.org/book/ch03.html|Chapter 3]]) ===== 2. myscript.py: argv ===== from sys import argv print(argv) The list ''argv'' contains the name of the script plus any parameters used in the invocation. $ python myscript.py One TWO three ['myscript.py', 'One', 'TWO', 'three'] If the number of parameters is fixed beforehand you can unpack them all into variables. Otherwise you use the index number to get a particular parameter. first_param = argv[1] #using the index script_name, first, second, third = argv #unpacking the argv list into four variables **Create a script named ''myscript.py'' that produces the following output when executed with the parameters indicated:** $ python myscript.py file1.txt file2.txt Number of parameters: 2 Script name: myscript.py First parameter: file1.txt Second parameter: file2.txt **NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simplest solution might be to install python again (choose "Change Python") and then make sure "Add python.exe to Path" is selected (last option undir "Customize Python").** {{:public:t-malv-15-3:python-path-install.png?direct&200|}} ([[https://docs.python.org/3.3/using/cmdline.html#using-on-cmdline|Information on executing python scripts in Windows]]) ===== 3. mytokenize.py: Read file ===== **Create a script name ''mytokenize.py'' that reads a file contents, tokenizes them, removes stopwords and print out the remaining tokens, one per line.** from sys import argv from nltk import word_tokenize from nltk.corpus import stopwords #Get file name from argv (see problem 3). #Open file for reading. #Read contents into a string. #Tokenize the string. #Remove stopwords (words in stopwords.words('english')). #Print out the tokens, one per line. You should be able to invoke the script using ''python mytokenize.py test.txt''. ===== 4. rot13.py: Read and Write file ===== **Now create a script named ''rot13.py'' that reads the contents from one file, line by line and alters the lines with a simple algorithm before writing them to another file.** from sys import argv from codecs import encode #Get two file names from argv (see problem 3). #Open file1 for reading. #Open file2 for writing. # Loop; read one line from file1. line = encode(line, 'rot_13') # Write the line to file2. #(Close file2) #(Close file1) You should be able to invoke the script using ''python rot13.py input.txt output.txt'' for example. Copy some text and put into a test file. The function used to alter the lines is ''encode(string, 'rot_13')'' from ''codecs'', see [[https://en.wikipedia.org/wiki/ROT13|ROT13 on Wikipedia]] >>> from codecs import encode >>> print(encode('nun', 'rot_13')) aha ===== 5. List comprehension ==== **Rewrite the following loop as a "list comprehension"**: >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] >>> result = [] >>> for word in sent: ... word_len = (word, len(word)) ... result.append(word_len) >>> result [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)] (Problem 10 in [[http://www.nltk.org/book/ch03.html|Chapter 3]]). List comprehensions enable the descriptive construction of lists in a very compact, yet easily readable way. See some examples on [[http://www.secnetix.de/olli/Python/list_comprehensions.hawk|this page]] or [[https://www.google.com/?q=list+comprehension+python#safe=off&q=list+comprehension+python|google]] for others. {{:public:t-malv-15-3:listcomprehensions.gif?nolink&610|}} **If you feel this problem is easy you should also try your hand at problems 31 and 41.** ===== Possible Solutions ===== #1 >>> monty[::-1] == 'nohtyP ytnoM' True #2 from sys import argv print('Number of parameters: ', len(argv)-1) print('Script name: ', argv[0]) print('First parameter: ', argv[1]) print('Second parameter: ', argv[2]) #3 from sys import argv from nltk import word_tokenize from nltk.corpus import stopwords with open(argv[1]) as infile: for w in word_tokenize(infile.read()): if w.lower() not in stopwords.words('english'): print(w) #Since files are context managers, they can be used in a with-statement. #The file will close when the code block is finished, even if an exception occurs #4 from sys import argv from codecs import encode with open(argv[1]) as infile, open(argv[2], 'w') as outfile: for line in infile: outfile.write(encode(line, 'rot_13')) #5 [(w, len(w)) for w in sent]