[gawron@ngram ~]$ python
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import senseval
>>> X = senseval.instances('hard.pos')
>>> len(X)
4333
>>> X[0]
SensevalInstance(word='hard-a', position=20,
context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'),
('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'),
('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'),
('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'),
('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'),
('do', 'VB'), ('.', '.'), ("''", "''")],
senses=('HARD1',))
Note that X is a list of Senseval Instances and that a Senseval
instance has a context attribute containing a part-of-speech tagged
example sentence from the corpus, a senses attribute containing
the answer to our classification problem for that context (the
sense 'HARD1' in this case), and a position attribute
telling us where in the context a form of word occurs.
If you have a problem, make sure the following environment variable is set in Unix (not Python):
[gawron@ngram ~]$ echo $NLTK_DATAThe shell should reply:
/opt/lib/nltk/dataIf the variable is not set, enter the following line in your .bash_profile file ($HOME/.bash_profile):
export NLTK_DATA=/opt/lib/nltk/data
[gawron@ngram ~]$ python
Python 2.5.1 (r251:54863, Jul 10 2008, 17:24:48)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk.classify.maxent
>>> nltk.classify.maxent.demo()
Training classifier...
.... [lots of stuff!]....
Optimization terminated successfully.
Current function value: 0.404530
Iterations: 37
Function evaluations: 77
Gradient evaluations: 77
Testing classifier...
Accuracy: 0.7940
Avg. log likelihood: -0.5958
Unseen Names P(Male) P(Female)
----------------------------------------
Octavius *0.9756 0.0244
Thomasina 0.0291 *0.9709
Barnett *0.6795 0.3205
Angelina 0.0029 *0.9971
Saunders *0.8483 0.1517
>>>
This demos the max ent "names" model, a classifier which
assigns gender to proper names. Although
demo is run by a function named demo()
defined in nltk.classify.maxent, this, as shown below, is just a wrapper
for the real names_demo function
defined in nltk.classify.util.
Note that the features used for name recognition, are defined in the function nltk.classify.util.names_demo_features:
def names_demo_features(name):
features = {}
features['alwayson'] = True
features['startswith'] = name[0].lower()
features['endswith'] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features['count(%s)' % letter] = name.lower().count(letter)
features['has(%s)' % letter] = letter in name.lower()
return features
This illustrates the format for a feature function. It builds
and returns a dictionary whose keys are feature names and whose values
are the feature values of some context. In this application,
a context is a name string and the features are:
_inst_cache = {}
1 def wsd_demo(trainer, word, features=wsd_demo_features, n=1000):
2 """
3 JMG: A good value for word is 'hard.pos',
4 for which there are 4,333 instances in nltk.corpus.senseval.
5
6 """
7 from nltk.corpus import senseval
8 import random
9
10 # Get the instances.
11 print 'Reading data...'
12 global _inst_cache
13 if word not in _inst_cache:
14 _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
15 events = _inst_cache[word][:]
16 senses = list(set(l for (i,l) in events)); instances = [i for (i,l) in events]
17 vocab = extract_vocab(senses,300)
18 if n> len(events): n = len(events)
19 print ' Senses: ' + ' '.join(senses)
20 # Randomly split the names into a test & train set.
21 print 'Splitting into test & train...'
22 random.seed(123456)
23 random.shuffle(events)
24 train = events[:int(.8*n)]
25 test = events[int(.8*n):n]
26
27 # Train up a classifier.
28 print 'Training classifier...'
29 classifier = trainer( [(features(i,vocab), l) for (i,l) in train] )
30
31 # Run the classifier on the test data.
32 print 'Testing classifier...'
33 acc = accuracy(classifier, [(features(i,vocab),l) for (i,l) in test])
34 print 'Accuracy: %6.4f' % acc
35
36 # For classifiers that can find probabilities, show the log
37 # likelihood and some sample probability distributions.
38 try:
39 test_featuresets = [features(i,vocab) for (i,n) in test]
40 pdists = classifier.batch_prob_classify(test_featuresets)
41 ll = [pdist.logprob(gold)
42 for ((name, gold), pdist) in zip(test, pdists)]
43 print 'Avg. log likelihood: %6.4f' % (sum(ll)/len(test))
44 except NotImplementedError:
45 pass
46
47 # Return the classifier
48 return classifier
This function calls two functions not currently defined.
They are extract_vocab (called on line 17)
and wsd_demo_features (called in lines 29, 33, and 39, because the parameter
features will by default be bound to the function object wsd_demo_features).
# Google's stoplist with most preps removed. "and" added
stopwords = [ 'I', 'a', 'an', 'are', 'as', 'and',
'be', 'com', 'how', 'is', 'it', 'of', 'or',
'that', 'the', 'this', 'to', 'was', 'what',
'when', 'where', 'who', 'will', 'with',
'the', 'www']
features = {}
features['alwaystrue'] = True
And the last line shd be:
return features
features[w] = False
when w does not occur in senseval_inst.context.
######################################################################
#{ Demo
######################################################################
def demo():
from nltk.classify.util import names_demo
classifier = names_demo(MaxentClassifier.train)
if __name__ == '__main__':
demo()
to be:
######################################################################
# Demo
######################################################################
def demo():
classifier = wsd_demo(MaxentClassifier.train,'hard.pos',max_iter=50,n=4300\
)
return classifier
if __name__ == '__main__':
classifier = demo()
Loading the file (doing "Run" under idle), should now run the
demo function. Notice, I have set it to make the
number of "adj.pos" instances be 4300 and the number of
iterations be 50. Your shoudl experiment with increasing and
decreasing the number of iterations and report the results.