Computational Linguistics

Linguistics 581

Morphology and Finite-State Transducers

Morphological analysis: Finding morphological constituents

angrily = angry + ly
proven = prove + en
* haven = have + en
ducks = duck + pl
ducks = duck + 3rdsg
ground = ground (N, sg)
ground = grind + pst
undeniability = un + deny + able + ity

Part of speech: a class of words that share many properties. (more later). Examples: Nouns, verbs

Inflection vs. derivation

duck vs ducks, cat vs. cats, ox vs. oxen: All nouns have plural forms (almost?: equipment, apparatus, furniture, infantry)
walk vs walking, talk vs talking, smoke vs smoking: All nouns have -ing forms (gerund, present participle)
Spanish verb amare ("to love")

Inflection: a morphological alternation common to all members of a part of speech:

Sound system versus spelling

Allomorphy: plural s

fox: foxes, /f aa k s/ + /ax z/
dog: dogs, /d ao g/ + /z/
duck: ducks, /d uh k/ + /s/
lilly: lillies, /l ih l i/ + /z/

In this chapter we deal with spelling. This means we are concerned with spelling rules instead of phonological rules for allomorphy. We operate on orthographic representations not phonetic representations.

Spelling rules: plural
Orthographic Singular	Phonology	Orthographic Plural
teepee	/t iy p iy/ + /z/	teepees
lilly	/l ih l iy/ + /z/	lillies

Productive ending: [s] (the morpheme s, with its phonologically predictable allomorphs) versus irregular forms. Notice that many of the irregular forms are not formed by affixation.

Regular Irregular

ducks = duck + PL
lillies = lilly + PL
fox = fox + PL
hogs = hog + PL
houses = house + PL
cups = cup + PL
bellies = belly + PL

oxen = ox + PL
children = child + PL
deer = deer + PL
mice = mouse + PL
geese = goose + PL
men = man + PL
cacti = cactus + PL

Morphological analysis of a word:

morphological features

For example, for plural forms, we say that all plural forms share the morphological feature +PL. The plural forms deer, men, mice, and geese, which are not realized by affixation, share the morphological feature +PL with forms like foxes and ducks, which are. The forms deer, man, mouse,goose,fox, and duck all share the morphological feature +SG.

We assume the category is a morphological feature.

Parsing versus Recognition

Morphological recognition: Accepts and rejects forms:

Morphological parsing produces a morphological analysis (stem first, followed by category of stem, followed by all affixes):

Relating surface and "underlying" forms

Morphotactic recognition

Morphotactics is the syntax of morphemes: what order they come in, what kind of units they make.

A basic morphotactic fact about affixes is where they attach with respect to the stem.

Plural -s is a suffix, un- is a prefix. There are also morphotactic facts about what kinds of things affixes attach to:

The affix -able attaches to a verb and produces an adjective. The affix -ity attaches to an adjective and produces a noun.

Using FSAs to do recognition

Word-class automaton: The morphotactics
Lexicon assumed for recognition
1. All stems, with word-class info attached
  - fox, wine: reg-noun
  - geese, fish, pants: irreg-pl-noun
  - goose, fish, equipment, kindling: irreg-sg-noun
2. All affixes
Word automaton: A recognizer
Recognizes "morphotactic strings", not surface strings of English ("foxs", not "foxes"). A missing kind of information.
We need realizational rules that tell us: fox + s => foxes.

Finite-State transducers

We introduce Finite-state transducers, an augmentation of FSAs in which there are two tapes

A two-tape automaton, the upper tape an "underlying representation" tape and the lower a "surface representation"
The main idea: The FST is just as FSA on which the arcs are labeled by pairs of symbols, an underlying symbol and a surface symbol. An arc a can be taken just in case the current symbol on the top tape matches the underlying representation symbol on a and the current symbol on the bottom tape matches the surface representation symbol on a.
- A transition labeled a:e means "a" as underlying representation corresponds to "e" as surface representation
- A transition "a" (with no colon) is an abbreviation for "a:a", meaning "a" as underlying representation corresponds to "a" as surface representation.
Finite-State transducers(FSTs) give us the technology to do parsing.
Finite-State transducers give us the technology to do generation of surface forms
We will relate surface to underlying forms via an "intermediate" morphotactic representation. This will take two separate FSTs, one relating furface to intermediate representations, one relating intermediate to underlying representations.
Relating underlying to intermediate (morphotactic) representation:
1. The morphological lexicon is just a large FST relating stems to morphological word classes
  1. Stems related to word-class info
    - fox, wine; reg-noun
    - geese, fish, pants; irreg-pl-noun
    - fish, equipment, kindling; irreg-sg-noun
  2. The complication is irregular plurals related to an unpredictable singular:
2. The FST that captures this information
3. We also need to relate word-class information to morphological features and morphotactic information:The word-class transducer
4. The "composition" of these two transducers
5. Linguistically the picture is nice:

Relating intermediate (morphotactic) representation to surface (speeling rules).

The problem

beg + ing = begging: Consonant doubling
mak + ing = making: e-deletion
watch + s = watches: e-insertion

E-insertion: insert an e after a morpheme ending in x, s, or z and before a word-final s.

Chomsky/Halle style "rewrite" rules:

This can be modeled with an FST.

Interpreting the rule:

^ means a morpheme boundary.
# means a word boundary.
We assume word boundaries are "found" on the surface, but morpheme boundaries are not (written language: English). Thus every # transition is default (#:#), and every ^ transition is ^:eps.
Other means, basically, not {s,x,z}, not ^, not #.
This machine is designed to leave every word that the rule does NOT apply to unchanged.
State 0 is the irrelevant input state. We stay in state 0 until we see something relevant. State 0 is a final state.
State 1 means we found an {s,x,z}. State 2 means we've gone on to find a morpheme boundary.
State 2 is the rule-ready state. We've seen a morpheme end in {s,x,z}.
The FST incorporates ONLY the information in our rewrite rule (as written). It doesn't "know" anything else about English. Some strange cases:
1. The ^:eps transition from state 5 to state 2.
2. The {z,s,x} transition from 5 to 1. This is exercise 3.10.

Using the E-insertion rule when parsing:

The correct parse
A false path

Regular	Irregular
ducks = duck + PL lillies = lilly + PL fox = fox + PL hogs = hog + PL houses = house + PL cups = cup + PL bellies = belly + PL	oxen = ox + PL children = child + PL deer = deer + PL mice = mouse + PL geese = goose + PL men = man + PL cacti = cactus + PL

Underlying
State
Surface	a	s	s	e	s	s	#

Underlying	a	s	s	e	s	s	#
State	0	1	1	0	1	1	0
Surface	a	s	s	e	s	s	#

Underlying	a	s	s	^	eps	s
State	0	1	1	2	3	4	fail
Surface	a	s	s	eps	e	s	s	#