Maximum Entropy modeling assignment

  1. For this assignment you will use the NLTK senseval data to do sense disambiguation for the word hard. Below, for generality, I will use word to designate the word whose tokens we are doing sense disambiguation on. But for this assignment in particular, the only value word will have is hard.
  2. To check for access, start up python and proceed as follows:
    [gawron@ngram ~]$ python
    Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) 
    [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from nltk.corpus import senseval
    >>> X = senseval.instances('hard.pos')
    >>> len(X)
    4333
    >>> X[0]
    SensevalInstance(word='hard-a', position=20,
                     context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'),
                              ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'),
                              ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'),
                              ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'),
                              ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'),
                              ('do', 'VB'), ('.', '.'), ("''", "''")],
                     senses=('HARD1',))
    
    Note that X is a list of Senseval Instances and that a Senseval instance has a context attribute containing a part-of-speech tagged example sentence from the corpus, a senses attribute containing the answer to our classification problem for that context (the sense 'HARD1' in this case), and a position attribute telling us where in the context a form of word occurs.
  3. I have tried this on both bulba and ngram and it works for me.

    If you have a problem, make sure the following environment variable is set in Unix (not Python):

      NLTK_DATA=/opt/lib/nltk/data
    This can be checked by typing 'echo $NLTK_DATA' to a shell:
    [gawron@ngram ~]$ echo $NLTK_DATA
    
    The shell should reply:
    /opt/lib/nltk/data
    
    If the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
    export NLTK_DATA=/opt/lib/nltk/data
    
  4. Your work will require the nltk.classify.maxent module. To test access, proceed as follows:
    [gawron@ngram ~]$ python
    Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) 
    [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import nltk.classify.maxent
    >>> nltk.classify.maxent.demo()
    Training classifier...
    
    .... [lots of stuff!]....
    Optimization terminated successfully.
             Current function value: 0.404530
             Iterations: 37
             Function evaluations: 77
             Gradient evaluations: 77
    Testing classifier...
    Accuracy: 0.7940
    Avg. log likelihood: -0.5958
    
    Unseen Names      P(Male)  P(Female)
    ----------------------------------------
      Octavius        *0.9756   0.0244
      Thomasina        0.0291  *0.9709
      Barnett         *0.6795   0.3205
      Angelina         0.0029  *0.9971
      Saunders        *0.8483   0.1517
    >>> 
    
    This demos the max ent "names" model, a classifier which assigns gender to proper names. Although demo is run by a function named demo() defined in nltk.classify.maxent, this, as shown below, is just a wrapper for the real names_demo function defined in nltk.classify.util.

    Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:

    
    def names_demo_features(name):
        features = {}
        features['alwayson'] = True
        features['startswith'] = name[0].lower()
        features['endswith'] = name[-1].lower()
        for letter in 'abcdefghijklmnopqrstuvwxyz':
            features['count(%s)' % letter] = name.lower().count(letter)
            features['has(%s)' % letter] = letter in name.lower()
        return features
    
    This illustrates the format for a feature function. It builds and returns a dictionary whose keys are feature names and whose values are the feature values of some context. In this application, a context is a name string and the features are:
    1. a count feature for each letter of the alphabet, e.g., 'count(h)', which returns the number of 'h's occurring in the name.
    2. an occurrence feature for each letter of the alphabet, e.g., 'has(h)', which returns True if 'h' occurs in the name, and False if not.
    3. 'startswith' and 'endswith' features which return the letter the the name starts/ends with.
  5. Your task is to get a word sense disambiguation demo working for the word 'hard' (the subcorpus of senseval is called 'hard.pos') using a max entropy classifier.
  6. Here is the demo function you want to get working:
    
    _inst_cache = {}
    1  def wsd_demo(trainer, word, features=wsd_demo_features, n=1000):
    2     """
    3     JMG: A good value for word is 'hard.pos',
    4     for which there are 4,333 instances in nltk.corpus.senseval.
    5 
    6     """
    7     from nltk.corpus import senseval
    8     import random
    9
    10    # Get the instances.
    11    print 'Reading data...'
    12    global _inst_cache
    13    if word not in _inst_cache:
    14        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
    15    events = _inst_cache[word][:]
    16    senses = list(set(l for (i,l) in events)); instances = [i for (i,l) in events]
    17    vocab = extract_vocab(senses,300)
    18    if n> len(events): n = len(events)
    19    print '  Senses: ' + ' '.join(senses)
    
    20    # Randomly split the names into a test & train set.
    21    print 'Splitting into test & train...'
    22    random.seed(123456)
    23    random.shuffle(events)
    24    train = events[:int(.8*n)]
    25    test = events[int(.8*n):n]
    26
    27    # Train up a classifier.
    28    print 'Training classifier...'
    29    classifier = trainer( [(features(i,vocab), l) for (i,l) in train] )
    30
    31    # Run the classifier on the test data.
    32    print 'Testing classifier...'
    33    acc = accuracy(classifier, [(features(i,vocab),l) for (i,l) in test])
    34    print 'Accuracy: %6.4f' % acc
    35
    36    # For classifiers that can find probabilities, show the log
    37    # likelihood and some sample probability distributions.
    38    try:
    39        test_featuresets = [features(i,vocab) for (i,n) in test]
    40        pdists = classifier.batch_prob_classify(test_featuresets)
    41        ll = [pdist.logprob(gold)
    42              for ((name, gold), pdist) in zip(test, pdists)]
    43        print 'Avg. log likelihood: %6.4f' % (sum(ll)/len(test))
    44    except NotImplementedError:
    45        pass
    46    
    47    # Return the classifier
    48    return classifier
    
    This function calls two functions not currently defined. They are extract_vocab (called on line 17) and wsd_demo_features (called in lines 29, 33, and 39, because the parameter features will by default be bound to the function object wsd_demo_features).
    1. extract_vocab(senselist, n):
      1. senselist: a list of word sense instances (the kinds of objects senseval.instances returns in line 14).
      2. n, an integer which specifies the number of vocabulary items to collect. Line 17 in wsd_demo is written so as to collect 300. The function extract_vocab should return a vocabulary of the n most frequently occurring words in the contexts in senselist. It is a good idea to exempt some of THE most frequently occurring words (called stopwords) and not use them as features. Here is a modified version Google's list of stopwords (which are never used in document search queries):
            # Google's stoplist with most preps removed. "and" added
            stopwords = [ 'I',    'a',    'an',    'are',    'as',    'and',
                          'be',    'com',   'how',  'is',    'it',    'of',    'or',
                          'that',    'the',  'this',    'to',    'was',    'what',
                          'when',   'where',    'who',    'will',    'with',    
                          'the',    'www']
        
      3. wsd_demo_features(senseval_inst, vocab): This returns a dictionary of the features of senseval_inst. This feature dictionary is intended to be used in the max ent word_sense demo and should follow the conventions below. You should experiment with new features for extra credit but minimally, implement the following:
        1. a feature named 'alwaystrue' that always returns True. Thus the first two lines of code in the definition of wsd_demo_features are:
          features = {}
          features['alwaystrue'] = True
          
          And the last line shd be:
          return features
          
        2. For each of the 300 most frequently occurring vocab items, w, implement a feature such that features[w] = True if and only if w occurs in senseval_inst.context. Note the feature dictionary should always return values for each of the 300 vocab items you are using, whether or not w occurs, so make sure:
                features[w] = False
                
          when w does not occur in senseval_inst.context.
        3. A feature for the part of speech of word in the context (senseval_inst.context[senseval_inst.position][1]).
        4. A feature checking whether the part of speech of the FOLLOWING word is TO.
        Your assignment is to define both these functions and get the wsd_demo working.
      4. A simple way to implement this is just to edit a copy of nltk.classify.maxent. Add the above definition of wsd_demo, and define extract_vocab and wsd_demo_features. Then edit the code snippet at the end of the file:
        ######################################################################
        #{ Demo
        ######################################################################
        def demo():
            from nltk.classify.util import names_demo
            classifier = names_demo(MaxentClassifier.train)
        
        if __name__ == '__main__':
            demo()
        
        to be:
        ######################################################################          
        # Demo                                                                         
        ######################################################################          
        def demo():
            classifier = wsd_demo(MaxentClassifier.train,'hard.pos',max_iter=50,n=4300\
        )
            return classifier
        
        if __name__ == '__main__':
            classifier = demo()
        
        Loading the file (doing "Run" under idle), should now run the demo function. Notice, I have set it to make the number of "adj.pos" instances be 4300 and the number of iterations be 50. Your shoudl experiment with increasing and decreasing the number of iterations and report the results.