San Diego State University logo
Computational Corpus Linguistics

Linguistics 571

 

Fall 2005
MW 2:00–3:15
Room BA-412

Advances in technology have revolutionized the way linguists approach their data. Using computers, extremely large bodies of text ("corpora") can be collected and analyzed at a level of detail that only a generation ago would have been unthinkable. For linguists and computer scientists alike, the accelerating growth of the World Wide Web and other natural language resources have made techniques for dealing with very large texts more important than ever.

Through a combination of lectures, demonstrations, and hands-on exercises, this course will give students an introduction to the skills necessary for computer-aided text manipulation. Students will learn to construct and search text databases using Unix tools, to write python programs to manipulate large natural language corpora, and to use statistical software to perform quantitative analysis of linguistic data.

Instructor

Rob Malouf
Office: BA 310A
Office Hours: Mondays and Tuesdays 1:00-2:00, or by appointment
Email: rmalouf@mail.sdsu.edu
Phone: (619) 594-7111

Requirements

The final grade will be based on homework assignments (30%), a midterm project (30%), and a final project (40%).

Through the term, there will be five hands-on homework assignments in which students apply the techniques learned in class to actual corpus materials. Since it's important to not get behind on assignments, late assignments will be accepted for partial credit for one week only after the due date unless prior arrangements are made. Working in groups is encouraged, but please include the names of all coworkers on the assignment.

The final project should be a program (with documentation) to perform some substantial corpus processing task. Alternatively, the final project can be the collection and annotation of a new corpus. More details about both projects will be given later in the term.

Readings

There are two required textbooks for this course:

Alan Gauld. 2001. Learn to Program Using Python. Addison Wesley.

and

Jon Lasser. 2000. Think Unix. Pearson Education.

Both of these books are available in the campus bookstore. In addition, you might find it useful to have a comprehensive Python reference manual, such as:

David M. Beazley. 2001. Python Essential Reference. Second Edition. New Riders.

This should be easy to find at local or on-line bookstores.

Additional readings will be made available in class or via the "Resources" section of the course web page.

Schedule

Week 1–2 Introduction
Background · Why corpus linguistics? · What is a corpus? · Corpus types · Constructing corpora

Week 3–5 Text manipulation with Unix
Computational linguistics lab · Introduction to Unix · Unix text tools · Counting words · Regular expressions · Tokenization · Representing corpora · Markup languages

Week 6–11 Python
What is Python? · Basic Python programming · Tokenization revisited · Python data structures · Stemming · Tagging

Week 12–14 Quantitative linguistics
Quantitative data analysis · Collocations and idioms · Text types and genre (Prof. Csomay)

Week 15 Future prospects
Very very large corpora · World Wide Web as a corpus · Bioinformatics · Computational linguistics

Lectures

Aug 31 Introduction slides handout  
Sep 7 Corpus types slides handout  
Sep 12 Unix slides handout  
Sep 14 Unix (cont.) slides handout  
Sep 19 Unix (cont.) slides handout  
Sep 21 Regular expressions slides handout  
Sep 26 Regular expressions (cont.) slides handout  
Sep 28 Regular expressions (cont.)  
Oct 3 Python slides handout hello.py, hello2.py, hello3.py, ctof.py
Oct 5 Python (cont.) wc.py, types.py
Oct 10 LAB
Oct 12 LAB
Oct 17 Tokenization slides handout tokenize.py
Oct 19 Unicode slides handout
Oct 24 Unicode (cont.) slides handout
Oct 26 Palindromes, etc.
Oct 31 Tagging slides handout
Nov 2 Tagging (cont.) slides handout bigram.py, trigram.py, monkeys.py
Nov 7 Quantitative linguistics slides handout
Nov 9 Prof. Csomay
Nov 14 LAB
Nov 16 LAB
Nov 21 Quantitative linguistics (cont.) slides handout
Nov 23 Midterm
Nov 28 Collocations slides handout
Nov 30 Collocations (cont.) slides handout

Links

rmalouf@mail.sdsu.edu
Last modified: Wed Nov 30 15:46:00 PST 2005