San Diego State University logo
Statistical methods in computational linguistics

Linguistics 681

 

Spring 2005
MW 7:00–8:15
Room BA-412

This course offers an introduction to statistical methods in computational linguistics. Through a combination of lectures, demonstrations, and hands-on exercises, this course will give students an introduction to the skills necessary for evaluating constructing statistical natural language processing applications and for evaluating their results. Topics to be covered include:

  • basic probability and information theory
  • statistics for corpus analysis and hypothesis testing
  • Markov chains and sequence models
  • probabilistic context-free grammars
  • stochastic attribute value grammars

Pre-/co-requisite: Ling 581 or equivalent (some experience with python or a similar scripting language will be helpful)

Instructor

Rob Malouf
Office: BA 310A
Office Hours: Mondays 1:30-3:00, or by appointment
Email: rmalouf@mail.sdsu.edu
Phone: (619) 594-7111

Requirements

The final grade will be based on homework assignments (20%), a take-home midterm exam (30%), and a final project (50%).

Through the term, there will be occasional homework assignments to practice the techniques learned in class. Working in groups is encouraged, but please include the names of all coworkers on the assignment.

The final project for this course will be a group project to design, implement, document, and evaluate an NLP application based on the statistical methods covered in the course. The details will depend on the interests of the students, but one possible project would be a system which can correctly answer `fill-in-the-blank' vocabulary questions from exams like the SAT or GRE.

Readings

The required textbooks for this course are:

Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.

Eugene Charniak. 1993. Statistical Language Learning. Cambridge: MIT Press.

They are for sale in the campus bookstore and at Amazon, etc. Updates and corrections to the first book can be downloaded from the authors' website.

Additional readings will be made available in class or via the "Resources" section of the course web page.

Schedule

Week 1–4    Introduction (Chapters 1, 2)
Background · Mathematical background · Probability · Information Theory

Week 5–8 Statistics (Chapters 3, 4, 5)
Descriptive statistics · Graphical methods · Hypothesis testing · Paired sign test · Bootstrap · Corpus statistics

Week 9–10 Sequence models (Chapters 6, 9, 10)
N grams · Smoothing · Hidden Markov models · Viterbi decoding · Part of speech tagging

Week 11–14 Parsing (Chapters 11, 12)
Probabilistic context free grammars · Inside-Outside algorithm · Treebank grammars · Dependency-based models

Week 15 Class projects

Final project
Project due May 18

Resources

rmalouf@mail.sdsu.edu
Last modified: Tue May 24 13:14:43 PDT 2005