Go to this web page and read the material there, paying special attention to the coding examples. Go to the exercises at the bottom of the page and do problems 4 and 8. Here is what you should hand in in a single email message. You do not have to hand in paper versions of you answers in class.
Note: in order for the cfd plot function to work, matplotlib must be installed. This is another free Python module that is not part of the standard Python distribution, available here. If your followed the directions for installing optional NLTK packages given here, you already have matplotlib and its component parts, pylab and pyplot, installed.
Problem 23. Zipf's Law. Turn in the two loglog graphs the exercise asks you to create. I suggest you use the Brown corpus to create the graph based on English; Brown is about 1.2 M words. Here is how to get the Brown Corpus. In Python, do
import nltk nltk.download()This brings up a window you can interact with. There are some tabs at the top. Choose the tabl labeled Corpora and select Brown, and click the download button at the bottom of the window. Also turn in a discussion of what you learned from this exercise. Describe the graph you're seeing from the random vocabulary experiment in words and say a few words explaining it. What kind of words are the most frequent? In light of this, what does Zipf's Law really tell you about the frequency distribution of words?