public:t-malv-15-3:3
                Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| public:t-malv-15-3:3 [2015/09/03 09:18] – [3. tokenize.py: Read file] orvark | public:t-malv-15-3:3 [2024/04/29 13:33] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 31: | Line 31: | ||
| < | < | ||
| $ python myscript.py One TWO three | $ python myscript.py One TWO three | ||
| - | ['lab3-1.py', ' | + | ['myscript.py', ' | 
| </ | </ | ||
| Line 52: | Line 52: | ||
| </ | </ | ||
| - | **NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simples | + | **NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simplest | 
| {{: | {{: | ||
| Line 60: | Line 60: | ||
| ===== 3. mytokenize.py: | ===== 3. mytokenize.py: | ||
| - | **Create a script name '' | + | **Create a script name '' | 
| <code python> | <code python> | ||
| Line 69: | Line 69: | ||
| #Get file name from argv (see problem 3). | #Get file name from argv (see problem 3). | ||
| #Open file for reading. | #Open file for reading. | ||
| - | #Read contents into string. | + | #Read contents into a string. | 
| #Tokenize the string. | #Tokenize the string. | ||
| #Remove stopwords (words in stopwords.words(' | #Remove stopwords (words in stopwords.words(' | ||
| Line 75: | Line 75: | ||
| </ | </ | ||
| - | You should be able to invoke the script using '' | + | You should be able to invoke the script using '' | 
| Line 128: | Line 128: | ||
| **If you feel this problem is easy you should also try your hand at problems 31 and 41.** | **If you feel this problem is easy you should also try your hand at problems 31 and 41.** | ||
| + | |||
| + | ===== Possible Solutions ===== | ||
| + | |||
| + | <code python> | ||
| + | #1 | ||
| + | >>> | ||
| + | True | ||
| + | |||
| + | #2 | ||
| + | from sys import argv | ||
| + | |||
| + | print(' | ||
| + | print(' | ||
| + | print(' | ||
| + | print(' | ||
| + | |||
| + | #3 | ||
| + | from sys import argv | ||
| + | from nltk import word_tokenize | ||
| + | from nltk.corpus import stopwords | ||
| + | |||
| + | with open(argv[1]) as infile: | ||
| + | for w in word_tokenize(infile.read()): | ||
| + | if w.lower() not in stopwords.words(' | ||
| + | print(w) | ||
| + | |||
| + | #Since files are context managers, they can be used in a with-statement. | ||
| + | #The file will close when the code block is finished, even if an exception occurs | ||
| + | |||
| + | #4 | ||
| + | from sys import argv | ||
| + | from codecs import encode | ||
| + | |||
| + | with open(argv[1]) as infile, open(argv[2], | ||
| + | for line in infile: | ||
| + | outfile.write(encode(line, | ||
| + | |||
| + | #5 | ||
| + | [(w, len(w)) for w in sent] | ||
| + | </ | ||
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/3.1441271929.txt.gz · Last modified: 2024/04/29 13:32 (external edit)
                
                