Linguistics 681

Statistical Methods in Computational Linguistics

Required Texts

Manning, C. and Schuetze, H. 2000. Foundations of Satistical Natural Language Processing.

Charniak, E. 1998. Statistical Language Learning. MIT Press.

Reading packet.

Course Description

This is a survey of statistical methods in computational linguistics that explores some of the motivations for and alternatives to statistical techniques covered in the Introduction to Computational Linguistics I and II. Topics covered include Markov chains and Hidden Markov Models, statistical estimators for n-gram models, finding collocations and subcategorization frames, collecting selectional preferences, part-of-speech tagging, word sense disambiguation, and probabilistic context-free grammars.

Grading

Assignments(40%)
Midterm (20%)
Final(40%)

Course Outline

Week 1:

Review of Probability theory

Week 2:

Introduction to Information Theory.

Week 3:

Review of n-gram models and data sparseness. Maximum Likelihood estimation for n-gram models.

Week 4:

Smoothing methods. Linear interpolation. Backoff.

Week 5: Markov chains and Hidden Markov models (HMMs).

Week 6: Application of HMMS to trigrams and part-of-speech tagging. Viterbi search.

Week 7: HMM training. Forward-backward algorithm.

Week 8, 9 and 10:

Probabilistic context-free grammars and lexicalized probabilistic grammars.

Week 11

Statistical alignment of bilingual corpora

Week 12

Authorship attribution. The case of the Federalist papers.

Week 13

Word-clustering.

Week 14,15

Information retrieval. Vector space model. Latent semantic indexing.