Textbook
Problems
|
 
|
Exercises 2.1,2.4 and 2.8 at the end of chapter 2
in Jurafsky and Martin on pp. 53-56.
|
Regular
Recognition
|
 
|
Describe the class of strings matched by the following regular expressions:
- [a-zA-Z]+
- [A-Z][a-z]*
- \d+(\.\d+)?
- ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*
- \w+|[^\w\s]+
|
|
Corpus
|
 
|
For exercises A and B below, use the Wall Street Journal as your corpus.
This is located at:
There are three subdirectories containing the articles for
particular years:
|
|
Compression
|
 
|
The files in the nanc wsj corpus are
compressed, so before passing them to egrep you
need to uncompress them. Thus to search for instance
of the choleric in the 1995 wsj corpus, you
would cd to
/opt/corpora/nanc/wsj/1995
and execute:
gunzip -c * | egrep choleric
gunzip -c [filename] uncompresses a file
and passes it to standard output where
egrep can read it.
|
|
Exercise A
|
 
|
Search for all word tokens that meet the following description:
They contain all the English vowels (counting 'y' as a vowel) in
alphabetical order.
Your answer to this question should meet the following requirements:
- List the distinct words you found meeting the above description.
- Give the regular expression you used.
- Do not list the lines grep returned.
In other words if grep returns 500 lines of text from the
WSJ corpus, just give the regular expression you used
to find those 500 lines and the distinct words that matched the regular
expression.
|
|
Exercise B
|
 
|
Search for all instances of words that meet the following description:
They contain all the English vowels in
alphabetical order, with no other vowels interrupting
the sequence. This, time don't
count 'y' as a vowel. In other words,
look for words that contain 'a','e','i',
'o' and 'u' in that order with only consonants between.
What are the words? What regular expression did you use?
|