Introduction to Computational Linguistics

Tools
Homework

    Assignment One

    Textbook
    Problems
     

    Exercises 2.1,2.4 and 2.8 at the end of chapter 2 in Jurafsky and Martin on pp. 53-56.

    Regular
    Recognition
     

    Describe the class of strings matched by the following regular expressions:

    1. [a-zA-Z]+
    2. [A-Z][a-z]*
    3. \d+(\.\d+)?
    4. ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*
    5. \w+|[^\w\s]+
    Corpus  

    For exercises A and B below, use the Wall Street Journal as your corpus.

    This is located at:

      /opt/corpora/nanc/wsj/
    There are three subdirectories containing the articles for particular years:
    • 1994
    • 1995
    • 1996
    Compression   The files in the nanc wsj corpus are compressed, so before passing them to egrep you need to uncompress them. Thus to search for instance of the choleric in the 1995 wsj corpus, you would cd to
    /opt/corpora/nanc/wsj/1995
    
    and execute:
    gunzip -c * | egrep choleric
    
    gunzip -c [filename] uncompresses a file and passes it to standard output where egrep can read it.
    Exercise A  

    Search for all word tokens that meet the following description:

      They contain all the English vowels (counting 'y' as a vowel) in alphabetical order.
    Your answer to this question should meet the following requirements:
    1. List the distinct words you found meeting the above description.
    2. Give the regular expression you used.
    3. Do not list the lines grep returned.
    In other words if grep returns 500 lines of text from the WSJ corpus, just give the regular expression you used to find those 500 lines and the distinct words that matched the regular expression.

    Exercise B  

    Search for all instances of words that meet the following description:

      They contain all the English vowels in alphabetical order, with no other vowels interrupting the sequence. This, time don't count 'y' as a vowel. In other words, look for words that contain 'a','e','i', 'o' and 'u' in that order with only consonants between.
    What are the words? What regular expression did you use?