Maximum Entropy modeling assignment

  1. For this assignment you need a Python package that is NOT part of the standrad Python distro. It is called nltk. A google search on the string "nltk" will direct you to the nltk home page, or you can go to:
      NLTK home page
    The home page will tell you NOT to use nltk with Python 2.6 or later (in particular, the terrifying Python 3.0 that is now available). So if at all possible you should stick to Python 2.5.X. Among the optional python packages needed for some portions of nltk, you will need numpy. You will also need to install the data for nltk after installing the software. So your install needs are:
    1. Nltk software: Follow platform specific directions on NLTK download page
    2. Numpy: Follow platform specific directions on NLTK download page
    3. NLTK data: Follow the directions on the NLTK data page. Note: Installing the data is definitely optioonal for this assignment. You can get all the data you need from the XML corpus below.
  2. For this assignment you will use the NLTK senseval data to do sense disambiguation for the word hard. Below, for generality, I will use word to designate the word whose tokens we are doing sense disambiguation on. But for this assignment in particular, the only value word will have is hard.
  3. To check for nltk access. start up python and proceed as follows:
    [gawron@ngram ~]$ python
    Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48) 
    [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import nltk.classify.maxent
    >>> nltk.classify.maxent.demo()
    Training classifier...
    
          Iteration    Log Likelihood    Accuracy
          ---------------------------------------
                 1          -0.69315        0.374
                 2          -0.61426        0.626
           .... [lots more iterations!]....
    Optimization terminated successfully.
             Current function value: 0.404530
             Iterations: 37
             Function evaluations: 77
             Gradient evaluations: 77
    Testing classifier...
    Accuracy: 0.7940
    Avg. log likelihood: -0.5958
    
    Unseen Names      P(Male)  P(Female)
    ----------------------------------------
      Octavius        *0.9756   0.0244
      Thomasina        0.0291  *0.9709
      Barnett         *0.6795   0.3205
      Angelina         0.0029  *0.9971
      Saunders        *0.8483   0.1517
    >>> 
    
    This demos the max ent "names" model, a classifier which assigns gender to proper names. Although demo is run by a function named demo() defined in nltk.classify.maxent, this, as shown below, is just a wrapper for the real names_demo function defined in nltk.classify.util.

    Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:

    
    def names_demo_features(name):
        features = {}
        features['alwayson'] = True
        features['startswith'] = name[0].lower()
        features['endswith'] = name[-1].lower()
        for letter in 'abcdefghijklmnopqrstuvwxyz':
            features['count(%s)' % letter] = name.lower().count(letter)
            features['has(%s)' % letter] = letter in name.lower()
        return features
    
    This illustrates the format for a feature function. It builds and returns a dictionary whose keys are feature names and whose values are the feature values of some context. In this application, a context is a name string and the features are:
    1. a count feature for each letter of the alphabet, e.g., 'count(h)', which returns the number of 'h's occurring in the name.
    2. an occurrence feature for each letter of the alphabet, e.g., 'has(h)', which returns True if 'h' occurs in the name, and False if not.
    3. 'startswith' and 'endswith' features which return the letter the the name starts/ends with.
  4. I have tried importing maxent on both bulba and ngram and it works for me on both machines.

    If you have a problem, make sure the following environment variable is set in Unix on bulba/ngram (not Python):

      NLTK_DATA=/opt/lib/nltk/data
    This can be checked by typing 'echo $NLTK_DATA' to a shell:
    [gawron@ngram ~]$ echo $NLTK_DATA
    
    The shell should reply:
    /opt/lib/nltk/data
    
    If the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
    export NLTK_DATA=/opt/lib/nltk/data
    
  5. Your task is to get a word sense disambiguation demo working for the word 'hard' (the subcorpus of senseval is called 'hard.pos') using a max entropy classifier.
  6. This corpus is represented in this file senseval-hard.xml in xml format.
  7. There are 3 senses for hard in this corpus. The first thing you should do is get a sense for how hard the disambiguation task is by computing a "baseline" score. This is the accurracy score earned by a word sense disambiguator that always guesses the most probable sense of hard.
  8. The first part of your task is to extract some features from the corpus and to put them in a file in a standard format described below. The second part is to test your model using the nltk maxent module.
  9. Here are some lines of the xml file corresponding to one event:
    ⟨senseval_instance word="hard-a" sense="HARD1" position="20"⟩
    ``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC 
    someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP 
    and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_''
    ⟨/senseval_instance⟩
    
    This is a tagged sentence with a token of the word hard in position 20 used in the sense "HARD1". There are 3 senses in the corpus which roughly seem to be:

    1. "HARD1": difficult to do
    2. "HARD2": potent (as in "the hard stuff")
    3. "HARD3": physically resistant to denting, bending, or scratching

    Here is a link to the "event" you want to extract from this sentence, in the standard format the nltk code is expecting: extracted event. Each event begins with "BEGIN EVENT" on a single line and ends with "END EVENT" on a single line. The second line in each event is the word sense occurring with that token of hard. All the lines following up until the "END EVENT" are the context features of that sentence. You will notice that many the features are just a word followed by 1 or 0. These are features telling you whether that word occurred in this sentence. You will notice that many of the words in the sentence have been left out. This is because, to save space, only the 300 most common words have features.
  10. The file extract_event.py, which is here, contains some sample python wrapper code for extracting information from an xml file senseval-hard.xml and storing it in another file senseval-hard.evt file in the above format. It calls two functions that are not implemented. They are extract_vocab and wsd_features
    1. extract_vocab(event_list, n) with two arguments:
      1. event_list: a list of the events extracted from the xml corpus. It's up to you to decide what's in an event, but minimally it has to contain all the information you're going to output to the extracted event file. And for the purposes of extract-vocab, collecting th 300 most common vocab items, an event needs to contain the words in the sentnce.
      2. n, an integer which specifies the number of vocabulary items to collect. The function extract_vocab should return a vocabulary of the n most frequently occurring words in the contexts in senselist. It is a good idea to exempt some of THE most frequently occurring words (called stopwords) and not use them as features. Here is a modified version Google's list of stopwords (which are never used in document search queries):
            # Google's stoplist with most preps removed. "and" added
            stopwords = [ 'I',    'a',    'an',    'are',    'as',    'and',
                          'be',    'com',   'how',  'is',    'it',    'of',    'or',
                          'that',    'the',  'this',    'to',    'was',    'what',
                          'when',   'where',    'who',    'will',    'with',    
                          'the',    'www']
        
    2. wsd_features(senseval_inst, vocab): This returns a dictionary of the features in the sentence, which should be everything you want to print to the extracted event file execpt the class. You should experiment with new features for extra credit but minimally, implement the following:
      1. a feature named 'alwaystrue' that always has the value 1. Thus the first two lines of code in the definition of wsd_features are:
        features = {}
        features['alwaystrue'] = 1
        
        And the last line shd be:
        return features
        
      2. For each of the 300 most frequently occurring vocab items, w, implement a feature such that features[w] = 1 if and only if w occurs in senseval_inst.context. Note the feature dictionary should always return values for each of the 300 vocab items you are using, whether or not w occurs, so make sure:
              features[w] = False
              
        when w does not occur in senseval_inst.context.
      3. A feature checking whether the part of speech of the FOLLOWING word is TO.
      Your assignment is to define both these functions and convert the xml file into an extracted event file in the above format. Then test this with the nltk maxent module.
    3. A simple way to call the nltk maxent module on a file in the extracted event format is just to use the following script: call_maxent.py. This will run a max ent demo using part of filename as the extracted event file to train on, and part of it as test data, and will report your scores. An optional second integer argument controls the number of iterations the GIS algorithm uses during training. The default is 50:
        call_maxent.py senseval-hard.evt 25
    4. List of files
      1. the corpus (sense tagged XM file)
      2. sample extracted event
      3. extract_event.py (Python wrapper code for mapping from XML to extracted event file)
      4. call_maxent.py (Python script for calling max ent module)
    5. List of links
      1. NLTK home page
      2. NLTK download page
      3. NLTK data page.