Tagging
Task
|
 
|
The Penn Treebank Tagset
|
Sentence |
CLAWS-F5
|
Brown
|
Penn
|
ICE
|
|
she
|
PNP
|
PPS
|
PRP
|
PRON(pers,sing)
|
|
was
|
VBD
|
BEDZ
|
VBD
|
AUX(past,past)
|
|
told
|
VVN
|
VBN
|
VBN
|
V(ditr,edp)
|
|
that
|
CJT
|
CS
|
IN
|
CONJUNC(subord)
|
|
the
|
AT0
|
AT
|
DT
|
ART(def)
|
|
journey
|
NN1
|
NN
|
NN
|
N(com,sing)
|
|
might
|
PNP
|
PPS
|
PRP
|
AUX(modal,passt)
|
|
kill
|
PNP
|
VB
|
VB
|
V(montr,infin)
|
|
her
|
PNP
|
PP0
|
PRP
|
PRON(pers,sing)
|
|
.
|
PUN
|
.
|
.
|
PUNC(per)
|
Many distinctions much finer grained than standard set of
linguist categories. Some sort of yuselessly so.
Example: Brown tagset and some descendants have distinct tags for
differnt forms of verb BE/HAVE (HVD=had).
Not as fine-grained as a unification gramnmar with features.
|
Uses of
Tagging
|
 
|
Well, why would you want to do this?
Tagged Corpora and corpus sites.
Question: How was the BNC tagged?
|
Difficulty
of Problem |
 
|
Ambiguity
The Null hypothesis tagger: Tag each word with its most
frequent tag. Gets about 90% right.
A major source of difficulty and a major contributoer to the
error with all known taggers is unknown words. What do you
tag an unknown word?
Default strategy: Look at what tag most unknown words get in
some development test data. Use that tag for
all unknown words.
What tag is that? Guess....
A possible augmentation for any tagger: A dictionary.
[Reduces but does not eliminate unknown words.]
| Tagger | Error Rate | Note |
|
Church HMM tagger
|
1-5%
|
Depending on Def of "correct"
|
|
Garside et al.
|
3-4%
|
Probabilistic
Plus idiom rules
|
|
De Rose
|
3-4%, 5.6%
|
WSJ, other
|
|
Brill Initial
|
7.9%
|
"Simple" algorithm
|
|
Brill
|
5%
|
71 "patches" (rules)
|
Brill's tagger
|
|
Approaches
|
 
|
Two standard approaches:
- Rule-based
- Statistically based
Within rule-based we can distinguish two other types:
- Handwritten
- Machine-learned
|
Tagset
Variation
|
 
|
Some tagsets are harder than others.
|
Tag Set
|
Basic Size
|
Total tags
|
|
Brown
|
87
|
179
|
|
Penn |
45
|
 
|
|
CLAWS1
|
132
|
 
|
|
CLAWS2
|
166
|
 
|
|
CLAWS c5 (BNC)
|
62
|
 
|
London
Lund
|
197
|
 
|
Brill says: "There are 192 tags in the Brown corpus, 96 of which
occur more than 100 times."
BNC Tagset ("Claws", C5)
One potential task is to define a tagset that
maximizes utility for parsing.
|
Brill's
tagger
|
 
|
Properties
- Rule based
- Rules are automatically learned.
A rule space.
Possible rule forms:
- If a word is tagged a and it is
in context C, then change that tag to b.
- If a word is tagged a and it has
lexical property P C, then change that tag to b.
- If a word is tagged a and and a woprd in region R
has lexical property P C, then change that tag to b.
Possible patch templates (rule templates):
Change tag a to tag b when:
- The preceding (following) word is tagged z.
- The word 2 after (before) is tagged z.
- One of the two following (preceding) words
is tagged z.
- One of the three following (preceding) words
is tagged z.
- The preceding word is tagged z and the following
word is tagged w.
- The preceding (following) word is tagged z and the
word 2 before (after) is tagged w.
- The current word is (is not) capitalized.
- The previous word is (is not) capitalized.
Training Algorithm (supervised):
- Initial training: For each word learn its most
frequent tag. [Training corpus, 90%] [Note: This
is the first place where lexically
specific info comes into play.]
- Patch acquisition [Development Test corpus,5%]:
- Collect a list of error triples in
the form [Taga, Tagb, Number]
- For each error triple and each patch template, find the patch
template that gives the best net error gain, where
net error gain = errors removed - errors added
and add that patch to the patch list.
- Runtime[Test Corpus:5%]
- Tag the test corpus using the initial training tagger.
- Revise for each word w
and each patch template changing tag a to
tag b whenever word w occurs in the
training corpus with tag b. [Note: This
is the second place where lexically
specific info comes into play.]
Note: there are two simple but important
refinements of the initial training having to do with
the treatment of unknown words.
- Capitalized unknown words are tagged as proper names.
- For other unknown words, assign the tag most common for words
ending in the same 3 letters:
blahblahous
gets tagged an adjective.
Some sample rules found by Brill's algorithm:
- TO IN NEXT-TAG AT
- VBN VBD PREV-WORD-IS-CAP YES
- VBD VBN PREV-1-OR-2-OR-3-TAG HVD
- TO IN NEXT-WORD-IS-CAP YES
- NN VB PREV-TAG MD
- PPS PPO NEXT-TAG .
- VBN VBD PREV-TAG PPS
- NP NN CURRENT-WORD-IS-CAP NO
Key:
TO Infinitval to
AT Article
IN Preposition
VBN past tense Verb
VBN past participle Verb
NP Proper Noun
NN Common Noun
MD Modal
PPO Objective (Accusative) Personal Pronoun
PPS Subject (Nominative) Personal Pronoun
HVD Had
Summary:
- Competitive with statistical taggers.
- Portable, doesnt depend on any particular tagset/corpus
properties
- Simple/low memory overhead.
|