[gawron@ngram ~]$ python
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.classify.maxent
>>> nltk.classify.maxent.demo()
Training classifier...
Iteration Log Likelihood Accuracy
---------------------------------------
1 -0.69315 0.374
2 -0.61426 0.626
.... [lots more iterations!]....
Optimization terminated successfully.
Current function value: 0.404530
Iterations: 37
Function evaluations: 77
Gradient evaluations: 77
Testing classifier...
Accuracy: 0.7940
Avg. log likelihood: -0.5958
Unseen Names P(Male) P(Female)
----------------------------------------
Octavius *0.9756 0.0244
Thomasina 0.0291 *0.9709
Barnett *0.6795 0.3205
Angelina 0.0029 *0.9971
Saunders *0.8483 0.1517
>>>
This demos the max ent "names" model, a classifier which
assigns gender to proper names. Although
demo is run by a function named demo()
defined in nltk.classify.maxent, this, as shown below, is just a wrapper
for the real names_demo function
defined in nltk.classify.util.
Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:
def names_demo_features(name):
features = {}
features['alwayson'] = True
features['startswith'] = name[0].lower()
features['endswith'] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features['count(%s)' % letter] = name.lower().count(letter)
features['has(%s)' % letter] = letter in name.lower()
return features
This illustrates the format for a feature function. It builds
and returns a dictionary whose keys are feature names and whose values
are the feature values of some context. In this application,
a context is a name string and the features are:
If you have a problem, make sure the following environment variable is set in Unix on bulba/ngram (not Python):
[gawron@ngram ~]$ echo $NLTK_DATAThe shell should reply:
/opt/lib/nltk/dataIf the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
export NLTK_DATA=/opt/lib/nltk/data
〈senseval_instance word="hard-a" sense="HARD1" position="20"〉 ``_`` he_PRP may_MD lose_VB all_DT popular_JJ support_NN ,_, but_CC someone_NN has_VBZ to_TO kill_VB him_PRP to_TO defeat_VB him_PRP and_CC that_DT 's_VBZ hard_JJ to_TO do_VB ._. ''_'' 〈/senseval_instance〉This is a tagged sentence with a token of the word hard in position 20 used in the sense "HARD1". There are 3 senses in the corpus which roughly seem to be:
# Google's stoplist with most preps removed. "and" added
stopwords = [ 'I', 'a', 'an', 'are', 'as', 'and',
'be', 'com', 'how', 'is', 'it', 'of', 'or',
'that', 'the', 'this', 'to', 'was', 'what',
'when', 'where', 'who', 'will', 'with',
'the', 'www']
features = {}
features['alwaystrue'] = 1
And the last line shd be:
return features
features[w] = False
when w does not occur in senseval_inst.context.