public:t-malv-15-3:3
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:t-malv-15-3:3 [2015/09/03 09:18] – [3. tokenize.py: Read file] orvark | public:t-malv-15-3:3 [2024/04/29 13:33] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 31: | Line 31: | ||
< | < | ||
$ python myscript.py One TWO three | $ python myscript.py One TWO three | ||
- | ['lab3-1.py', ' | + | ['myscript.py', ' |
</ | </ | ||
Line 52: | Line 52: | ||
</ | </ | ||
- | **NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simples | + | **NOTE: The python installer for Windows does not seem to add python to the path by default. If you can't invoke python in the Command Prompt (cmd) the simplest |
{{: | {{: | ||
Line 60: | Line 60: | ||
===== 3. mytokenize.py: | ===== 3. mytokenize.py: | ||
- | **Create a script name '' | + | **Create a script name '' |
<code python> | <code python> | ||
Line 69: | Line 69: | ||
#Get file name from argv (see problem 3). | #Get file name from argv (see problem 3). | ||
#Open file for reading. | #Open file for reading. | ||
- | #Read contents into string. | + | #Read contents into a string. |
#Tokenize the string. | #Tokenize the string. | ||
#Remove stopwords (words in stopwords.words(' | #Remove stopwords (words in stopwords.words(' | ||
Line 75: | Line 75: | ||
</ | </ | ||
- | You should be able to invoke the script using '' | + | You should be able to invoke the script using '' |
Line 128: | Line 128: | ||
**If you feel this problem is easy you should also try your hand at problems 31 and 41.** | **If you feel this problem is easy you should also try your hand at problems 31 and 41.** | ||
+ | |||
+ | ===== Possible Solutions ===== | ||
+ | |||
+ | <code python> | ||
+ | #1 | ||
+ | >>> | ||
+ | True | ||
+ | |||
+ | #2 | ||
+ | from sys import argv | ||
+ | |||
+ | print(' | ||
+ | print(' | ||
+ | print(' | ||
+ | print(' | ||
+ | |||
+ | #3 | ||
+ | from sys import argv | ||
+ | from nltk import word_tokenize | ||
+ | from nltk.corpus import stopwords | ||
+ | |||
+ | with open(argv[1]) as infile: | ||
+ | for w in word_tokenize(infile.read()): | ||
+ | if w.lower() not in stopwords.words(' | ||
+ | print(w) | ||
+ | |||
+ | #Since files are context managers, they can be used in a with-statement. | ||
+ | #The file will close when the code block is finished, even if an exception occurs | ||
+ | |||
+ | #4 | ||
+ | from sys import argv | ||
+ | from codecs import encode | ||
+ | |||
+ | with open(argv[1]) as infile, open(argv[2], | ||
+ | for line in infile: | ||
+ | outfile.write(encode(line, | ||
+ | |||
+ | #5 | ||
+ | [(w, len(w)) for w in sent] | ||
+ | </ |
/var/www/cadia.ru.is/wiki/data/attic/public/t-malv-15-3/3.1441271929.txt.gz · Last modified: 2024/04/29 13:32 (external edit)