Intro to Comp Ling

Tagging

    Tagging
    Task
      The Penn Treebank Tagset

      Sentence CLAWS-F5 Brown Penn ICE
      she PNP PPS PRP PRON(pers,sing)
      was VBD BEDZ VBD AUX(past,past)
      told VVN VBN VBN V(ditr,edp)
      that CJT CS IN CONJUNC(subord)
      the AT0 AT DT ART(def)
      journey NN1 NN NN N(com,sing)
      might PNP PPS PRP AUX(modal,passt)
      kill PNP VB VB V(montr,infin)
      her PNP PP0 PRP PRON(pers,sing)
      . PUN . . PUNC(per)

    Many distinctions much finer grained than standard set of linguist categories. Some sort of yuselessly so. Example: Brown tagset and some descendants have distinct tags for differnt forms of verb BE/HAVE (HVD=had).

    Not as fine-grained as a unification gramnmar with features.

    Uses of
    Tagging
     

    Well, why would you want to do this?

    Tagged Corpora and corpus sites.

    Question: How was the BNC tagged?

    Difficulty
    of Problem
     

    Ambiguity

    The Null hypothesis tagger: Tag each word with its most frequent tag. Gets about 90% right.

    One major source of difficulty is tag ambiguity. Many tag ambiguities are quite systematic in ways that are particular for English. For instance, here's some information about tag ambiguities in BNC, taken from the BNC website. (bnc2error.htm#table2)

      (a) Tag

      (b) SingleTag count (out of 50,000 words)

      (c) Ambiguity Tag count (out of 50,000 words)

      (d) Ambiguity rate (%)

      (c / b + c)

      (e) 1st tag of ambiguity tag correct (% of all ambiguity tags)

      (f) Error count

      (g) Error rate (%)

      (f / b)

      AJ0

      3412

      all 338

      9.01%

      282 (83.43%)

      46

      1.35%

         

      (AJ0-AVO 48)

             
         

      (AJ0-NN1 209)

             
         

      (AJ0-VVD 21)

             
         

      (AJ0-VVG 28)

             
         

      (AJ0-VVN 32)

             

      AJC

      142

       

      0.0%

       

      4

      2.82%

      AJS

      26

       

      0.0%

       

      2

      7.69%

      AT0

      4351

       

      0.0%

       

      2

      0.05%

      AV0

      2450

      all 45

      1.80%

      37 (82.22%)

      57

      2.33%

         

      (AV0-AJ0 45)

             

      AVP

      379

      all 44

      10.40%

      34 (77.27%)

      6

      1.58%

         

      (AVP-PRP 44)

             

      AVQ

      157

      all 10

      5.99%

      10 (100.00%)

      9

      5.73%

         

      (AVQ-CJS 10)

             

      CJC

      1915

       

      0.0%

       

      3

      0.16%

      CJS

      692

      all 39

      5.34%

      30 (76.92%)

      18

      2.60%

         

      (CJS-AVQ 26)

             
         

      (CJS-PRP 13)

             

      CJT

      236

      (all) 28

      10.61%

       

      3

      1.27%

         

      (CJT-DT0 28 )

             

      CRD

      940

      all 1

      0.11%

      0 (0.00%)

      0

      0.00%

         

      (CRD-PNI 1)

             

      DPS

      787

       

      0.0%

       

      0

      0.00%

      DT0

      1180

      all 20

      1.67%

      16 (80.00%)

      19

      1.61%

         

      (DT0-CJT 20)

             

      DTQ

      370

       

      0.0%

       

      0

      0.00%

      EX0

      131

       

      0.0%

       

      1

      0.76%

      ITJ

      214

       

      0.0%

       

      2

      0.93%

      NN0

      270

       

      0.0%

       

      10

      3.70%

      NN1

      7198

      all 514

      6.66%

      395 (76.84%)

      86

      1.19%

         

      (NN1-AJ0 130)

             
         

      (NN1-NP0 92)*

             
         

      (NN1-VVB 243)

             
         

      (NN1-VVG 49)

             

      NN2

      2718

      all 55

      1.98%

      48 (87.27%)

      30

      1.10%

         

      (NN2-VVZ 55)

             

      NP0

      1385

      all 264

      16.01%

      224 (84.84%)

      31

      2.24%

         

      (NP0-NN1 264)*

             

      ORD

      136

       

      0.0%

       

      0

      0.00%

      PNI

      159

      all 8

      4.79%

      3 (37.50%)

      5

      3.14%

         

      (PNI-CRD 8)

             

      PNP

      2646

       

      0.0%

       

      0

      0.00%

      PNQ

      112

       

      0.0%

       

      0

      0.00%

      PNX

      84

       

      0.0%

       

      0

      0.00%

      POS

      217

       

      0.0%

       

      5

      2.30%

      PRF

      1615

       

      0.0%

       

      0

      0.00%

      PRP

      4051

      all 166

      3.94%

      154 (92.77%)

      24

      0.59%

         

      (PRP-AVP 132)

             
         

      (PRP-CJS 34)

             

      TO0

      819

       

      0.0%

       

      6

      0.73%

      UNC

      158

       

      0.0%

       

      4

      2.53%

      VBB

      328

       

      0.0%

       

      1

      0.30%

      VBD

      663

       

      0.0%

       

      0

      0.00%

      VBG

      37

       

      0.0%

       

      0

      0.00%

      VBI

      374

       

      0.0%

       

      0

      0.00%

      VBN

      133

       

      0.0%

       

      0

      0.00%

      VBZ

      640

       

      0.0%

       

      4

      0.63%

      VDB

      87

       

      0.0%

       

      0

      0.00%

      VDD

      71

       

      0.0%

       

      0

      0.00%

      VDG

      10

       

      0.0%

       

      0

      0.00%

      VDI

      36

       

      0.0%

       

      0

      0.00%

      VDN

      20

       

      0.0%

       

      0

      0.00%

      VDZ

      22

       

      0.0%

       

      0

      0.00%

      VHB

      150

       

      0.0%

       

      1

      0.67%

      VHD

      258

       

      0.0%

       

      0

      0.00%

      VHG

      16

       

      0.0%

       

      0

      0.00%

      VHI

      119

       

      0.0%

       

      0

      0.00%

      VHN

      9

       

      0.0%

       

      0

      0.00%

      VHZ

      116

       

      0.0%

       

      1

      0.86%

      VM0

      782

       

      0.0%

       

      3

      0.38%

      VVB

      560

      all 84

      13.04%

      56 (66.67%)

      84

      15.00%

         

      (VVB-NN1 84)

             

      VVD

      970

      all 90

      8.49%

      62 (58.89%)

      50

      5.15%

         

      (VVD-AJ0 11)

             
         

      (VVD-VVN 79)*

             

      VVG

      597

      all 132

      18.11%

      112 (84.84%)

      9

      1.51%

         

      (VVG-AJ0 83)

             
         

      (VVG-NN1 49)

             

      VVI

      1211

       

      0.0%

       

      7

      0.58%

      VVN

      1086

      all 158

      12.70%

      113 (71.52%)

      27

      2.49%

         

      (VVN-AJ0 50)

             
         

      (VVN-VVD 108)*

             

      VVZ

      295

      all 26

      8.10%

      14 (53.85%)

      11

      3.73%

         

      (VVZ-NN2 26)

             

      XX0

      363

       

      0.0%

       

      0

      0.00%

      ZZ0

      75

       

      0.0%

       

      3

      4.00%

    Another major source of difficulty and a major contributer to the error with all known taggers is unknown words. What do you tag an unknown word?

    Default strategy: Look at what tag most unknown words get in some development test data. Use that tag for all unknown words.

    What tag is that? Guess....

    A possible augmentation for any tagger: A dictionary. [Reduces but does not eliminate unknown words.]

    Tagger Error Rate Note
    Church HMM tagger 1-5% Depending on Def of "correct"
    Garside et al. 3-4% Probabilistic
    Plus idiom rules
    De Rose 3-4%, 5.6% WSJ, other
    Brill Initial 7.9% "Simple" algorithm
    Brill 5% 71 "patches" (rules)

    Brill's tagger

    Approaches  

    Two standard approaches:

    • Rule-based
    • Statistically based

    Within rule-based we can distinguish two other types:

    1. Handwritten
    2. Machine-learned
    Tagset
    Variation
     

    Some tagsets are harder than others.

      Tag Set Basic Size Total tags
      Brown 87 179
      Penn 45  
      CLAWS1 132  
      CLAWS2 166  
      CLAWS c5 (BNC) 62  
      London
      Lund
      197  
    Brill says: "There are 192 tags in the Brown corpus, 96 of which occur more than 100 times."

    BNC Tagset ("Claws", C5)

    One potential task is to define a tagset that maximizes utility for parsing.

    Brill's
    tagger
     

    Properties

    • Rule based
    • Rules are automatically learned.

    A rule space.

    Possible rule forms:

    1. If a word is tagged a and it is in context C, then change that tag to b.
    2. If a word is tagged a and it has lexical property P C, then change that tag to b.
    3. If a word is tagged a and and a woprd in region R has lexical property P C, then change that tag to b.

    Possible patch templates (rule templates):

      Change tag a to tag b when:
      1. The preceding (following) word is tagged z.
      2. The word 2 after (before) is tagged z.
      3. One of the two following (preceding) words is tagged z.
      4. One of the three following (preceding) words is tagged z.
      5. The preceding word is tagged z and the following word is tagged w.
      6. The preceding (following) word is tagged z and the word 2 before (after) is tagged w.
      7. The current word is (is not) capitalized.
      8. The previous word is (is not) capitalized.

    Training Algorithm (supervised):

    1. Initial training: For each word learn its most frequent tag. [Training corpus, 90%] [Note: This is the first place where lexically specific info comes into play.]
    2. Patch acquisition [Development Test corpus,5%]:
      1. Collect a list of error triples in the form [Taga, Tagb, Number]
      2. For each error triple and each patch template, find the patch template that gives the best net error gain, where
          net error gain = errors removed - errors added
        and add that patch to the patch list.
    3. Runtime[Test Corpus:5%]
      1. Tag the test corpus using the initial training tagger.
      2. Revise for each word w and each patch template changing tag a to tag b whenever word w occurs in the training corpus with tag b. [Note: This is the second place where lexically specific info comes into play.]

    Note: there are two simple but important refinements of the initial training having to do with the treatment of unknown words.

    1. Capitalized unknown words are tagged as proper names.
    2. For other unknown words, assign the tag most common for words ending in the same 3 letters:
        blahblahous
      gets tagged an adjective.

    Some sample rules found by Brill's algorithm:

    1. TO IN NEXT-TAG AT
    2. VBN VBD PREV-WORD-IS-CAP YES
    3. VBD VBN PREV-1-OR-2-OR-3-TAG HVD
    4. TO IN NEXT-WORD-IS-CAP YES
    5. NN VB PREV-TAG MD
    6. PPS PPO NEXT-TAG .
    7. VBN VBD PREV-TAG PPS
    8. NP NN CURRENT-WORD-IS-CAP NO
    Key:
    TO Infinitval to
    AT Article
    IN Preposition
    VBN past tense Verb
    VBN past participle Verb
    NP  Proper Noun
    NN Common Noun
    MD Modal
    PPO Objective (Accusative) Personal Pronoun
    PPS Subject (Nominative) Personal Pronoun
    HVD Had
    

    Summary:

    1. Competitive with statistical taggers.
    2. Portable, doesnt depend on any particular tagset/corpus properties
    3. Simple/low memory overhead.