Tagging assignment

Exercise Chapter 5. 5.1., 5.2, 5.3

Next week you will do assignment 5.6.

Preliminary help with 5.6. This tells you a good data structure to use to store word tag counts, which you need for exercise 5.6

marks-macbook-pro:nltk gawron$ python
Enthought Canopy Python 2.7.3 | 64-bit | (default, Aug  8 2013, 05:37:06) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> from nltk.corpus import brown
>>> TW = brown.tagged_words()
>>> TW[:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> from collections import Counter
>>> from collections import defaultdict
>>> word_tag_dict = defaultdict(Counter)
>>> word_tag_dict['The'.lower()]['AT'] += 1
>>> word_tag_dict['FULTON'.lower()]['NP'] += 1
>>> word_tag_dict['County'.lower()]['NN'] += 1
>>> word_tag_dict['Grand'.lower()]['JJ'] += 1
>>> word_tag_dict['Jury'.lower()]['NN'] += 1
>>> 'NN-TL'.split('-')
['NN', 'TL']
>>> 'AT'.split('-')
['AT']
>>> TW[0]
('The','AT')
>>> word_tag_dict['fulton']['NP']
1

Here is the definition of freq_bigrams we used today in class.

def  freq_bigrams (bigram_list):
    freqdist = nltk.FreqDist()
    for (w1,w2) in bigram_list:
        freqdist.inc((w1.lower(),w2.lower()))
    return freqdist