======T-538-MALV and T-725-MALV, Natural Language Processing Fall 2009====== ===== Basic Info ===== * **Instructors: ** [[http://www.ru.is/kennarar/hrafn|Hrafn Loftsson]] and [[http://www.ru.is/kennarar/hannes|Hannes Högni Vilhjálmsson]] * **Contact: ** Office at Kringlan 1, 599 6227, {hrafn, hannes}@ * **Classes When: ** Mondays 8:15-9:50 and Thursdays 8:15-9:50 * **Classes Where: ** Kringlan 1, Room K-6 ===== Description ===== The goal of language technology (LT) is to develop systems which allow people to communicate with computers using natural languages. LT is an interdisciplinary field, requiring knowledge from subjects like linguistics, statistics, psychology, engineering and computer science. This course discusses fundamentals of natural language processing (NLP), which is one of the subfields of LT, and introduces research in the field. Students acquire understanding of the various stages of NLP, e.g. morphological analysis, part-of-speech tagging, syntactic analysis, semantic analysis, discourse and dialogue. In the course, students work on programming projects related to the aforementioned stages. ===== Goals ===== The course objectives are that students: * Know the main methods used in the field of natural language processing * Are familiar with the main research areas in the field * Are able to implement a system which processes a natural language ===== Coursework Overview ===== To provide a rich hands-on experience, students will build their own application that relies on NLP over the course of the semester. While the choice of final application is completely in the hands of the students, important NLP components will be built in a series of programming projects that lead up to the final system demonstration. Three homework assignments will also be distributed during the semester to reinforce some of the more theoretical material. Everything that has to be turned in, including the programming projects, should arrive no later than at 23:59 on the due date, or else incur 10% penalty for each additional day, including weekends and holidays. Projects are not accepted if handed in more than two days late. General participation in discussions in class count towards a special participation grade. In addition, M.Sc. level students need to prepare and give one presentation on an existing research paper in the field, which also counts towards their participation grade. ===== Assignments and Projects ===== ^Assignment^Code^Description^Assigned^Due^Duration^Weight^Data^ |Homework Assignment 1 |A1|{{:public:t-malv-09-3:assignmenti.pdf|Regular expressions}} |M 21. Sep |M 28. Sep | 1 week | 5% | | |Programming Project 1 |P1|{{:public:t-malv-09-3:project_tokenisation.pdf|Tokenizing text}} |M 28. Sep |T 8. Oct | 10 days | 8% - 10% | {{:public:t-malv-09-3:data.zip|Data}}| |Homework Assignment 2 |A2| {{:public:t-malv-09-3:assignmentii.pdf|Tagging}} |M 12. Oct |M 19. Oct | 1 week | 5% | | |Programming Project 2 |P2| {{:public:t-malv-09-3:tagging.pdf|Tagging text}} |M 19. Oct |T 29. Oct | 10 days | 8% - 10% | [[http://www.ru.is/faculty/hrafn/Data/eng.zip]]| |Programming Project 2 |P2| {{:public:t-malv-09-3:tagging_darmstadt.pdf|Tagging text - Darmstadt}} |F 27. Nov |F 11. Dec | 14 days | | [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/]]| |Programming Project 3 |P3|Parsing Text: {{:public:t-malv-09-3:partialparsing_icelandic.pdf|Icelandic}} or {{:public:t-malv-09-3:partialparsing_english.pdf|English}} |T 29. Oct |M 09. Nov | 10 days | 8% | | |Homework Assignment 3 |A3|{{:public:t-malv-09-3:assignment3.pdf|Discourse analysis}} |M 9. Nov |M 16. Nov | 1 week | 5% | | |[[Programming Project 4]] |P4| {{:public:t-malv-09-3:final.pdf|Application}} |M 16. Nov |W 2. Dec | 16 days | 16% - 20% | | ^ ^^^^^^Total 55% ^ ^ ===== Final Exam ===== There will be a final written exam counting 30% towards your grade. ===== Online Discussion Forum ===== The course has an online discussion forum that we can use in any way we see fit. Note that the students have to register on this forum to post their replies (simply go to the address below to register). ^Host^Forum Name^Location^ |ProBoards|NLP2009|[[http://ruclasses.proboards.com/index.cgi?board=nlp2009]]| ^ ^^^ ===== Lectures ===== Lectures are recorded outside the classroom using the Camtasia program (for Windows). If you have problems listening to and/or viewing the resulting .AVI files, and you are running Windows or Mac, then you probably need the appropriate codec from [[http://www.techsmith.com/download/codecs.asp]]. If you are using Ubuntu Linux then you should install a codec with **sudo apt-get install ffmpeg** and then use mmplayer to view the .AVI files. ^Week^Date^Topic^Material^Lecture^Who^ |1|thu 10/09| {{:public:t-malv-09-3:aboutthiscourse.pdf|About the course}} | | {{:public:t-malv-09-3:nlp-aboutcourse.avi|AVI}} | Hannes | |1|thu 10/09| {{:public:t-malv-09-3:introduction.pdf|Introduction}} | Chapter 1 | {{:public:t-malv-09-3:nlp-introduction.avi|AVI}} | Hannes | |1|mon 14/09| {{:public:t-malv-09-3:corpora.pdf|Corpora}} | Chapter 2.1 | {{:public:t-malv-09-3:nlp-corpora.avi|AVI}} | Hrafn | |1|mon 14/09| {{:public:t-malv-09-3:finitestate.pdf|Finite-state automata}} | Chapter 2.2 | {{:public:t-malv-09-3:nlp-fsa.avi|AVI}} | Hrafn | |2|thu 17/09| {{:public:t-malv-09-3:regex.pdf|Regular expressions}} | Chapters 2.3-2.4 | {{:public:t-malv-09-3:nlp-regex.avi|AVI}} {{:public:t-malv-09-3:nlp-regex-grep.avi|AVI}} | Hrafn | |2|mon 21/09| {{:public:t-malv-09-3:perl.pdf|Perl}} | http://www.ebb.org/PickingUpPerl/pickingUpPerl.pdf | {{:public:t-malv-09-3:nlp-perl1.avi|AVI}} {{:public:t-malv-09-3:nlp-perl2.avi|AVI}} | Hrafn | |3|thu 24/09| {{:public:t-malv-09-3:tokenisation.pdf|Tokenisation}}| Chapters 4.1-4.3. Chapter 2 in "Handbook of NLP". [[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.8947]] | {{:public:t-malv-09-3:nlp-tokenisation.avi|AVI}} {{:public:t-malv-09-3:nlp-jflex.avi|AVI}}| Hrafn | |3|mon 28/09| {{:public:t-malv-09-3:n-grams.pdf|Word counting and n-grams}} | Chapters 4.4-4.7| {{:public:t-malv-09-3:nlp-ngrams.avi|AVI}} {{:public:t-malv-09-3:nlp-probmodels.avi|AVI}} | Hrafn | |4|thu 01/10| {{:public:t-malv-09-3:morphology.pdf|Morphology}}| Chapter 5 | {{:public:t-malv-09-3:nlp-pos.avi|AVI}} {{:public:t-malv-09-3:nlp-morpho.avi|AVI}} | Hrafn | |4|mon 05/10| {{:public:t-malv-09-3:lexc.pdf|Lexicon compiler}}| | {{:public:t-malv-09-3:nlp-lexc.avi|AVI}} | Hrafn | |4|mon 05/10| **Student lecture**: Constructing lexical transducers| [[http://citeseer.ist.psu.edu/443780.html]] | | **Grímur** | |5|thu 08/10| {{:public:t-malv-09-3:tagging_rules.pdf|POS tagging - with rules}} | Chapters 6.1-6.3 | {{:public:t-malv-09-3:nlp-pos-tagging.avi|AVI}} {{:public:t-malv-09-3:nlp-pos-tagging-rules.avi|AVI}} | Hrafn | |5|thu 08/10| **Student lecture**: Tagging Icelandic text: A linguistic rule-based approach | [[http://nlp.ru.is/publications.htm]] | | **Tihomir** | |5|mon 12/10| {{:public:t-malv-09-3:tagging_statistics.pdf|POS tagging - with statistics}}| Chapters 7.1, 7.2.1-7.2.2 | {{:public:t-malv-09-3:nlp-pos-tagging-stat.avi|AVI}} {{:public:t-malv-09-3:nlp-pos-tagging-stat2.avi|AVI}} | Hrafn | |6|thu 15/10| {{:public:t-malv-09-3:syntax.pdf|Syntax analysis}}| Chapters 9.1-9.4 and 9.7 in "Speech and Language Processing | {{:public:t-malv-09-3:nlp-syntax1.avi|AVI}} {{:public:t-malv-09-3:nlp-syntax2.avi|AVI}} | Hrafn | |6|thu 15/10| **Student lecture**: A simple rule-based part of speech tagger | [[http://portal.acm.org/citation.cfm?id=974526]] | | **Emanuele** | |6|mon 19/10| {{:public:t-malv-09-3:cfg_and_prolog.pdf|Context-free grammar and Prolog}} | Chapters 8.1-8.4 | {{:public:t-malv-09-3:nlp-cfg-prolog.avi|AVI}} | Hrafn | |7|thu 22/10| **Student lecture**: Statistical Identification of Language | [[http://eprints.kfupm.edu.sa/66788/]] | | **Þór** | |7|thu 22/10| {{:public:t-malv-09-3:partialparsing.pdf|Partial parsing}} | Chapters 9.1, 9.3-9.4, 9.6, 9.9| {{:public:t-malv-09-3:nlp-partial-parsing.avi|AVI}} | Hrafn | |7|mon 26/10| {{:public:t-malv-09-3:iceparser.pdf|Partial parsing}} | IceParser: [[http://dspace.utlib.ee/dspace/handle/10062/2563]] | No AVI file | Hrafn | |7|mon 26/10| **Student lecture**: Text Chunking using Transformation-Based Learning | [[http://acl.ldc.upenn.edu/W/W95/W95-0107.pdf]] | | **Jeppe** | |8|thu 29/10| {{:public:t-malv-09-3:parsingtechniques.pdf|Parsing techniques}} | Chapters 11.1-11.4, 11.5.0| No AVI file| Hrafn | |8|mon 02/11| {{:public:t-malv-09-3:semantics.pdf|Semantics and predicate logic}} | Chapters 8.7, 12.1-12.9 | No AVI file | Hrafn | |9|thu 05/11| **Guest lecture**: {{:public:t-malv-09-3:lexsemmalvinnsla.pdf|Lexical semantics}}| Chapters 13.1-13.5 | | Matthew Whelpton | |9|mon 09/11| {{:public:t-malv-09-3:discourse.pdf|Discourse and reference resolution}}| Chapters 14.1-14.5, 14.7 (skip 14.7.4) + (Brown and Yule 1983) {{:public:t-malv-09-3:brown_yule_1983_chapter1.pdf|Sec. 1.1, 1.3}} | {{:public:t-malv-09-3:nlp-discourse.avi|AVI}} | Hannes| |10|thu 12/11| {{:public:t-malv-09-3:information.pdf|Information structure and newness of information}} | (Brown and Yule 1983) {{:public:t-malv-09-3:brown_yule_1983_chapter4.pdf|Sec. 4.1-4.2}} + ([[http://www.ling.upenn.edu/~ellen/givennew.pdf|Prince 1981]]) Sec. 1-3 | {{:public:t-malv-09-3:nlp-information.avi|AVI}} | Hannes | |10|mon 16/11| {{:public:t-malv-09-3:discstruct.pdf|Discourse structure and discourse markers}} | Chapters 14.6, 14.8 + (Allen 1995) {{:public:t-malv-09-3:allen_1995_chapter16.pdf|16.1-16.3}} | {{:public:t-malv-09-3:nlp-discstruct.avi|AVI}} | Hannes| |11|thu 19/11| {{:public:t-malv-09-3:grounding.pdf|Adjacency pairs, speech acts and grounding in dialogue}} | Chapter 15 | {{:public:t-malv-09-3:nlp-grounding.avi|AVI}} | Hannes | |11|mon 23/11| {{:public:t-malv-09-3:nonverbal.pdf|The role of non-verbal behaviour in communication}} | ({{:public:t-malv-09-3:hicss2005.pdf|Vilhjálmsson 2005}}) | {{:public:t-malv-09-3:nlp-nonverbal.avi|AVI}} | Hannes| |12|thu 26/11| Embodied Conversational Agents | ({{:public:t-malv-09-3:kbs2001.pdf|Cassell et al. 2001}}) | | Hannes | |12|mon 11/12| {{:public:t-malv-09-3:review2.pdf|Review for final exam}} | | | Hannes and Hrafn| ^^^^^^ ===== A selection of papers ===== ^Topic^Title^Link^ |N-grams| Statistical Identification of Language | [[http://eprints.kfupm.edu.sa/66788/]] | |N-grams| N-Gram-Based Text Categorization | [[http://citeseer.ist.psu.edu/68861.html]] | |N-grams| A Mixed Trigrams Approach for Context Sensitive Spell Checking | [[http://nlp.cs.uic.edu/PS-papers/spell-cicling07.pdf ]] | |Morphology | Applications of Finite-State Transducers in Natural Language Processing | [[http://www2.parc.com/istl/members/karttune/publications/ciaa-2000/fst-in-nlp.pdf]] | |Morphology | Constructing Lexical Transducers | [[http://citeseer.ist.psu.edu/443780.html]] | |Morphology | Guessing Morphological Classes of Unknown German Nouns | [[http://nats-www.informatik.uni-hamburg.de/~vhahn/Downloads/RANLP03.pdf]] | |Morphology | Automatic Rule Induction for Unknown Word Guessing | [[http://portal.acm.org/citation.cfm?id=972708]] | |Morphology | A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) | [[http://www.springerlink.com/content/h530q7157285563u/]] | |POS tagging | Tagging Icelandic txt: an experiment with integrations and combinations of taggers | [[http://nlp.ru.is/publications.htm]] | |POS tagging | Tagging Icelandic text: A linguistic rule-based approach | [[http://nlp.ru.is/publications.htm]] | |POS tagging | TnT - A Statistical Part-of-Speech Tagger | [[http://citeseer.ist.psu.edu/brants00tnt.html]] | |POS tagging | Comparing a Linguistic and a Stochastic Tagger | [[http://acl.ldc.upenn.edu/P/P97/P97-1032.pdf]] | |POS tagging | A simple rule-based part of speech tagger | [[http://portal.acm.org/citation.cfm?id=974526]] | |POS tagging | POS Tagging for German: How Important is the Right Context? | [[http://www.lrec-conf.org/proceedings/lrec2008/pdf/253_paper.pdf]] | |Parsing | Treebank Grammars | [[ http://www.nlp.org.cn/docs/docredirect.php?doc_id=25]] | |Parsing | Exploring Evidence for Shallow Parsing | [[http://acl.ldc.upenn.edu/W/W01/W01-0706.pdf]] | |Parsing | Text Chunking using Transformation-Based Learning | [[http://acl.ldc.upenn.edu/W/W95/W95-0107.pdf]] | |Parsing | IceParser: An Incremental Finite-State Parser for Icelandic | [[http://nlp.ru.is/publications.htm]] | |Parsing | Statistical Techniques for Natural Language Parsing | [[http://citeseer.ist.psu.edu/286958.html]] | |Discourse and Dialogue | Augmenting Online Conversation through Automatic Discourse Tagging | [[http://www.ru.is/faculty/hannes/publications/HICSS2005.pdf]] | |Discourse and Dialogue | Generating Dialogues Between Virtual Agents Automatically from Text | [[http://www.springerlink.com/index/p6265q6h81312001.pdf]] | |Discourse and Dialogue | More Than Just a Pretty Face: Conversational Protocols and the Affordances of Embodiment | [[http://www.ru.is/faculty/hannes/publications/KBS2001.pdf]] | |Discourse and Dialogue | Towards a model of face-to-face grounding | [[http://www.springerlink.com/index/p6265q6h81312001.pdf]] | |Discourse and Dialogue | Building Effective Question and Answering Characters | [[http://www.aclweb.org/anthology-new/W/W06/W06-1303.pdf]] | |Discourse and Dialogue | Semantic and Discourse Information for Text-to-Speech Intonation | [[ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.4835]] | =====Other material===== ^Title^ | {{:public:t-malv-09-3:icelandictagset.pdf|The Icelandic tagset}} | | The Penn Treebank tagset: [[http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html]]| | Icelandic Grammar: [[http://en.wikipedia.org/wiki/Icelandic_grammar]]| | The Database of Icelandic Inflections: [[http://bin.arnastofnun.is]]| | Icelandic PoS tagging Demo (IceNLP): [[http://nlp.ru.is/]] | | English PoS tagging Demo (Penn Treebank tagset): [[http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php]] | | English Constraint Grammar Demo: [[http://www2.lingsoft.fi/cgi-bin/engcg]] | | Constraint Grammar Development: [[http://visl.sdu.dk/constraint_grammar.html]] | | English Partial Parsing Demo: [[http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=SP]] | | Dialog Act Coding Schemes: [[http://www.dfki.de/mate/d11/chap4.html]] | | DAMSL Dialog Act Markup Scheme: [[http://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/RevisedManual.html]] | | Dialog Corpora Database: [[http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm]] | | The CSLU Spoken Language System Toolkit: [[http://www.cslu.ogi.edu/toolkit/]] | | CADIA BML Realizer: [[http://cadia.ru.is/projects/bmlr/]] | =====Course assessment===== ^Part of Course^Total Weight^ |Programming Project | 40%| |Participation | 15%| |Homework Assignments | 15%| |Final Written Exam | 30%| ^ Total 100% ^^