Table of Contents

T-538-MALV and T-725-MALV, Natural Language Processing Fall 2009

Basic Info

Description

The goal of language technology (LT) is to develop systems which allow people to communicate with computers using natural languages. LT is an interdisciplinary field, requiring knowledge from subjects like linguistics, statistics, psychology, engineering and computer science. This course discusses fundamentals of natural language processing (NLP), which is one of the subfields of LT, and introduces research in the field. Students acquire understanding of the various stages of NLP, e.g. morphological analysis, part-of-speech tagging, syntactic analysis, semantic analysis, discourse and dialogue. In the course, students work on programming projects related to the aforementioned stages.

Goals

The course objectives are that students:

Coursework Overview

To provide a rich hands-on experience, students will build their own application that relies on NLP over the course of the semester. While the choice of final application is completely in the hands of the students, important NLP components will be built in a series of programming projects that lead up to the final system demonstration. Three homework assignments will also be distributed during the semester to reinforce some of the more theoretical material.

Everything that has to be turned in, including the programming projects, should arrive no later than at 23:59 on the due date, or else incur 10% penalty for each additional day, including weekends and holidays. Projects are not accepted if handed in more than two days late.

General participation in discussions in class count towards a special participation grade. In addition, M.Sc. level students need to prepare and give one presentation on an existing research paper in the field, which also counts towards their participation grade.

Assignments and Projects

AssignmentCodeDescriptionAssignedDueDurationWeightData
Homework Assignment 1 A1Regular expressions M 21. Sep M 28. Sep 1 week 5%
Programming Project 1 P1Tokenizing text M 28. Sep T 8. Oct 10 days 8% - 10% Data
Homework Assignment 2 A2 Tagging M 12. Oct M 19. Oct 1 week 5%
Programming Project 2 P2 Tagging text M 19. Oct T 29. Oct 10 days 8% - 10% http://www.ru.is/faculty/hrafn/Data/eng.zip
Programming Project 2 P2 Tagging text - Darmstadt F 27. Nov F 11. Dec 14 days http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
Programming Project 3 P3Parsing Text: Icelandic or English T 29. Oct M 09. Nov 10 days 8%
Homework Assignment 3 A3Discourse analysis M 9. Nov M 16. Nov 1 week 5%
Programming Project 4 P4 Application M 16. Nov W 2. Dec 16 days 16% - 20%
Total 55%

Final Exam

There will be a final written exam counting 30% towards your grade.

Online Discussion Forum

The course has an online discussion forum that we can use in any way we see fit. Note that the students have to register on this forum to post their replies (simply go to the address below to register).

HostForum NameLocation
ProBoardsNLP2009http://ruclasses.proboards.com/index.cgi?board=nlp2009

Lectures

Lectures are recorded outside the classroom using the Camtasia program (for Windows). If you have problems listening to and/or viewing the resulting .AVI files, and you are running Windows or Mac, then you probably need the appropriate codec from http://www.techsmith.com/download/codecs.asp. If you are using Ubuntu Linux then you should install a codec with sudo apt-get install ffmpeg and then use mmplayer to view the .AVI files.

WeekDateTopicMaterialLectureWho
1thu 10/09 About the course AVI Hannes
1thu 10/09 Introduction Chapter 1 AVI Hannes
1mon 14/09 Corpora Chapter 2.1 AVI Hrafn
1mon 14/09 Finite-state automata Chapter 2.2 AVI Hrafn
2thu 17/09 Regular expressions Chapters 2.3-2.4 AVI AVI Hrafn
2mon 21/09 Perl http://www.ebb.org/PickingUpPerl/pickingUpPerl.pdf AVI AVI Hrafn
3thu 24/09 Tokenisation Chapters 4.1-4.3. Chapter 2 in “Handbook of NLP”. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.8947 AVI AVI Hrafn
3mon 28/09 Word counting and n-grams Chapters 4.4-4.7 AVI AVI Hrafn
4thu 01/10 Morphology Chapter 5 AVI AVI Hrafn
4mon 05/10 Lexicon compiler AVI Hrafn
4mon 05/10 Student lecture: Constructing lexical transducers http://citeseer.ist.psu.edu/443780.html Grímur
5thu 08/10 POS tagging - with rules Chapters 6.1-6.3 AVI AVI Hrafn
5thu 08/10 Student lecture: Tagging Icelandic text: A linguistic rule-based approach http://nlp.ru.is/publications.htm Tihomir
5mon 12/10 POS tagging - with statistics Chapters 7.1, 7.2.1-7.2.2 AVI AVI Hrafn
6thu 15/10 Syntax analysis Chapters 9.1-9.4 and 9.7 in “Speech and Language Processing AVI AVI Hrafn
6thu 15/10 Student lecture: A simple rule-based part of speech tagger http://portal.acm.org/citation.cfm?id=974526 Emanuele
6mon 19/10 Context-free grammar and Prolog Chapters 8.1-8.4 AVI Hrafn
7thu 22/10 Student lecture: Statistical Identification of Language http://eprints.kfupm.edu.sa/66788/ Þór
7thu 22/10 Partial parsing Chapters 9.1, 9.3-9.4, 9.6, 9.9 AVI Hrafn
7mon 26/10 Partial parsing IceParser: http://dspace.utlib.ee/dspace/handle/10062/2563 No AVI file Hrafn
7mon 26/10 Student lecture: Text Chunking using Transformation-Based Learning http://acl.ldc.upenn.edu/W/W95/W95-0107.pdf Jeppe
8thu 29/10 Parsing techniques Chapters 11.1-11.4, 11.5.0 No AVI file Hrafn
8mon 02/11 Semantics and predicate logic Chapters 8.7, 12.1-12.9 No AVI file Hrafn
9thu 05/11 Guest lecture: Lexical semantics Chapters 13.1-13.5 Matthew Whelpton
9mon 09/11 Discourse and reference resolution Chapters 14.1-14.5, 14.7 (skip 14.7.4) + (Brown and Yule 1983) Sec. 1.1, 1.3 AVI Hannes
10thu 12/11 Information structure and newness of information (Brown and Yule 1983) Sec. 4.1-4.2 + (Prince 1981) Sec. 1-3 AVI Hannes
10mon 16/11 Discourse structure and discourse markers Chapters 14.6, 14.8 + (Allen 1995) 16.1-16.3 AVI Hannes
11thu 19/11 Adjacency pairs, speech acts and grounding in dialogue Chapter 15 AVI Hannes
11mon 23/11 The role of non-verbal behaviour in communication (Vilhjálmsson 2005) AVI Hannes
12thu 26/11 Embodied Conversational Agents (Cassell et al. 2001) Hannes
12mon 11/12 Review for final exam Hannes and Hrafn

A selection of papers

TopicTitleLink
N-grams Statistical Identification of Language http://eprints.kfupm.edu.sa/66788/
N-grams N-Gram-Based Text Categorization http://citeseer.ist.psu.edu/68861.html
N-grams A Mixed Trigrams Approach for Context Sensitive Spell Checking http://nlp.cs.uic.edu/PS-papers/spell-cicling07.pdf
Morphology Applications of Finite-State Transducers in Natural Language Processing http://www2.parc.com/istl/members/karttune/publications/ciaa-2000/fst-in-nlp.pdf
Morphology Constructing Lexical Transducers http://citeseer.ist.psu.edu/443780.html
Morphology Guessing Morphological Classes of Unknown German Nouns http://nats-www.informatik.uni-hamburg.de/~vhahn/Downloads/RANLP03.pdf
Morphology Automatic Rule Induction for Unknown Word Guessing http://portal.acm.org/citation.cfm?id=972708
Morphology A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) http://www.springerlink.com/content/h530q7157285563u/
POS tagging Tagging Icelandic txt: an experiment with integrations and combinations of taggers http://nlp.ru.is/publications.htm
POS tagging Tagging Icelandic text: A linguistic rule-based approach http://nlp.ru.is/publications.htm
POS tagging TnT - A Statistical Part-of-Speech Tagger http://citeseer.ist.psu.edu/brants00tnt.html
POS tagging Comparing a Linguistic and a Stochastic Tagger http://acl.ldc.upenn.edu/P/P97/P97-1032.pdf
POS tagging A simple rule-based part of speech tagger http://portal.acm.org/citation.cfm?id=974526
POS tagging POS Tagging for German: How Important is the Right Context? http://www.lrec-conf.org/proceedings/lrec2008/pdf/253_paper.pdf
Parsing Treebank Grammars http://www.nlp.org.cn/docs/docredirect.php?doc_id=25
Parsing Exploring Evidence for Shallow Parsing http://acl.ldc.upenn.edu/W/W01/W01-0706.pdf
Parsing Text Chunking using Transformation-Based Learning http://acl.ldc.upenn.edu/W/W95/W95-0107.pdf
Parsing IceParser: An Incremental Finite-State Parser for Icelandic http://nlp.ru.is/publications.htm
Parsing Statistical Techniques for Natural Language Parsing http://citeseer.ist.psu.edu/286958.html
Discourse and Dialogue Augmenting Online Conversation through Automatic Discourse Tagging http://www.ru.is/faculty/hannes/publications/HICSS2005.pdf
Discourse and Dialogue Generating Dialogues Between Virtual Agents Automatically from Text http://www.springerlink.com/index/p6265q6h81312001.pdf
Discourse and Dialogue More Than Just a Pretty Face: Conversational Protocols and the Affordances of Embodiment http://www.ru.is/faculty/hannes/publications/KBS2001.pdf
Discourse and Dialogue Towards a model of face-to-face grounding http://www.springerlink.com/index/p6265q6h81312001.pdf
Discourse and Dialogue Building Effective Question and Answering Characters http://www.aclweb.org/anthology-new/W/W06/W06-1303.pdf
Discourse and Dialogue Semantic and Discourse Information for Text-to-Speech Intonation http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.4835

Other material

Title
The Icelandic tagset
The Penn Treebank tagset: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Icelandic Grammar: http://en.wikipedia.org/wiki/Icelandic_grammar
The Database of Icelandic Inflections: http://bin.arnastofnun.is
Icelandic PoS tagging Demo (IceNLP): http://nlp.ru.is/
English PoS tagging Demo (Penn Treebank tagset): http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php
English Constraint Grammar Demo: http://www2.lingsoft.fi/cgi-bin/engcg
Constraint Grammar Development: http://visl.sdu.dk/constraint_grammar.html
English Partial Parsing Demo: http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=SP
Dialog Act Coding Schemes: http://www.dfki.de/mate/d11/chap4.html
DAMSL Dialog Act Markup Scheme: http://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/RevisedManual.html
Dialog Corpora Database: http://www-rcf.usc.edu/~billmann/diversity/DDivers-site.htm
The CSLU Spoken Language System Toolkit: http://www.cslu.ogi.edu/toolkit/
CADIA BML Realizer: http://cadia.ru.is/projects/bmlr/

Course assessment

Part of CourseTotal Weight
Programming Project 40%
Participation 15%
Homework Assignments 15%
Final Written Exam 30%
Total 100%