Intro to Comp Ling

Tagging

    Tagging
    Task
      The Penn Treebank Tagset

      Sentence CLAWS-F5 Brown Penn ICE
      she PNP PPS PRP PRON(pers,sing)
      was VBD BEDZ VBD AUX(past,past)
      told VVN VBN VBN V(ditr,edp)
      that CJT CS IN CONJUNC(subord)
      the AT0 AT DT ART(def)
      journey NN1 NN NN N(com,sing)
      might PNP PPS PRP AUX(modal,passt)
      kill PNP VB VB V(montr,infin)
      her PNP PP0 PRP PRON(pers,sing)
      . PUN . . PUNC(per)

    Many distinctions much finer grained than standard set of linguist categories. Some sort of yuselessly so. Example: Brown tagset and some descendants have distinct tags for differnt forms of verb BE/HAVE (HVD=had).

    Not as fine-grained as a unification gramnmar with features.

    Uses of
    Tagging
     

    Well, why would you want to do this?

    Tagged Corpora and corpus sites.

    Question: How was the BNC tagged?

    Difficulty
    of Problem
     

    Ambiguity

    The Null hypothesis tagger: Tag each word with its most frequent tag. Gets about 90% right.

    A major source of difficulty and a major contributoer to the error with all known taggers is unknown words. What do you tag an unknown word?

    Default strategy: Look at what tag most unknown words get in some development test data. Use that tag for all unknown words.

    What tag is that? Guess....

    A possible augmentation for any tagger: A dictionary. [Reduces but does not eliminate unknown words.]

    Tagger Error Rate Note
    Church HMM tagger 1-5% Depending on Def of "correct"
    Garside et al. 3-4% Probabilistic
    Plus idiom rules
    De Rose 3-4%, 5.6% WSJ, other
    Brill Initial 7.9% "Simple" algorithm
    Brill 5% 71 "patches" (rules)

    Brill's tagger

    Approaches  

    Two standard approaches:

    • Rule-based
    • Statistically based

    Within rule-based we can distinguish two other types:

    1. Handwritten
    2. Machine-learned
    Tagset
    Variation
     

    Some tagsets are harder than others.

      Tag Set Basic Size Total tags
      Brown 87 179
      Penn 45  
      CLAWS1 132  
      CLAWS2 166  
      CLAWS c5 (BNC) 62  
      London
      Lund
      197  
    Brill says: "There are 192 tags in the Brown corpus, 96 of which occur more than 100 times."

    BNC Tagset ("Claws", C5)

    One potential task is to define a tagset that maximizes utility for parsing.

    Brill's
    tagger
     

    Properties

    • Rule based
    • Rules are automatically learned.

    A rule space.

    Possible rule forms:

    1. If a word is tagged a and it is in context C, then change that tag to b.
    2. If a word is tagged a and it has lexical property P C, then change that tag to b.
    3. If a word is tagged a and and a woprd in region R has lexical property P C, then change that tag to b.

    Possible patch templates (rule templates):

      Change tag a to tag b when:
      1. The preceding (following) word is tagged z.
      2. The word 2 after (before) is tagged z.
      3. One of the two following (preceding) words is tagged z.
      4. One of the three following (preceding) words is tagged z.
      5. The preceding word is tagged z and the following word is tagged w.
      6. The preceding (following) word is tagged z and the word 2 before (after) is tagged w.
      7. The current word is (is not) capitalized.
      8. The previous word is (is not) capitalized.

    Training Algorithm (supervised):

    1. Initial training: For each word learn its most frequent tag. [Training corpus, 90%] [Note: This is the first place where lexically specific info comes into play.]
    2. Patch acquisition [Development Test corpus,5%]:
      1. Collect a list of error triples in the form [Taga, Tagb, Number]
      2. For each error triple and each patch template, find the patch template that gives the best net error gain, where
          net error gain = errors removed - errors added
        and add that patch to the patch list.
    3. Runtime[Test Corpus:5%]
      1. Tag the test corpus using the initial training tagger.
      2. Revise for each word w and each patch template changing tag a to tag b whenever word w occurs in the training corpus with tag b. [Note: This is the second place where lexically specific info comes into play.]

    Note: there are two simple but important refinements of the initial training having to do with the treatment of unknown words.

    1. Capitalized unknown words are tagged as proper names.
    2. For other unknown words, assign the tag most common for words ending in the same 3 letters:
        blahblahous
      gets tagged an adjective.

    Some sample rules found by Brill's algorithm:

    1. TO IN NEXT-TAG AT
    2. VBN VBD PREV-WORD-IS-CAP YES
    3. VBD VBN PREV-1-OR-2-OR-3-TAG HVD
    4. TO IN NEXT-WORD-IS-CAP YES
    5. NN VB PREV-TAG MD
    6. PPS PPO NEXT-TAG .
    7. VBN VBD PREV-TAG PPS
    8. NP NN CURRENT-WORD-IS-CAP NO
    Key:
    TO Infinitval to
    AT Article
    IN Preposition
    VBN past tense Verb
    VBN past participle Verb
    NP  Proper Noun
    NN Common Noun
    MD Modal
    PPO Objective (Accusative) Personal Pronoun
    PPS Subject (Nominative) Personal Pronoun
    HVD Had
    

    Summary:

    1. Competitive with statistical taggers.
    2. Portable, doesnt depend on any particular tagset/corpus properties
    3. Simple/low memory overhead.