Linguistics 581
Morphology and Finite-State Transducers
Morphological analysis: Finding morphological constituents
- angrily = angry + ly
- proven = prove + en
- * haven = have + en
- ducks = duck + pl
- ducks = duck + 3rdsg
- ground = ground (N, sg)
- ground = grind + pst
- undeniability = un + deny + able + ity
Part of speech: a class of words that share many properties.
(more later). Examples: Nouns, verbs
Inflection vs. derivation
- duck vs ducks, cat vs. cats, ox vs. oxen:
All nouns have plural forms (almost?: equipment, apparatus, furniture,
infantry)
- walk vs walking, talk vs talking, smoke vs smoking: All nouns have
-ing forms (gerund, present participle)
- Spanish verb amare ("to love")
Inflection: a morphological alternation common to all members of
a part of speech:
walking = walk + ing
Form = Stem + Suffix
Sound system versus spelling
- English consonants
- English vowels
Allomorphy: plural s
- fox: foxes, /f aa k s/ + /ax z/
- dog: dogs, /d ao g/ + /z/
- duck: ducks, /d uh k/ + /s/
- lilly: lillies, /l ih l i/ + /z/
In this chapter we deal with spelling. This means we
are concerned with spelling rules instead of
phonological rules
for allomorphy. We operate on orthographic representations
not phonetic representations.
Spelling rules: plural
Orthographic Singular | Phonology | Orthographic Plural |
teepee | /t iy p iy/ + /z/ | teepees |
lilly | /l ih l iy/ + /z/ | lillies |
Productive ending: [s] (the morpheme s, with
its phonologically predictable allomorphs) versus irregular forms.
Notice that many of the irregular forms are not formed by affixation.
Regular
|
Irregular
|
- ducks = duck + PL
- lillies = lilly + PL
- fox = fox + PL
- hogs = hog + PL
- houses = house + PL
- cups = cup + PL
- bellies = belly + PL
|
- oxen = ox + PL
- children = child + PL
- deer = deer + PL
- mice = mouse + PL
- geese = goose + PL
- men = man + PL
- cacti = cactus + PL
|
Morphological analysis of a word:
The stem plus the various morphological
features of the word, whether or not
they are signaled by affixation.
For example, for plural forms, we say that all plural
forms share the morphological feature +PL.
The plural forms deer, men, mice, and
geese, which are not realized
by affixation, share the morphological feature +PL
with forms like foxes and ducks,
which are. The forms
deer, man, mouse,goose,fox, and
duck all share the morphological feature
+SG.
We assume the category is a morphological feature.
Parsing versus Recognition
Morphological recognition: Accepts and rejects forms:
Accept: geese
Reject: gooses
Morphological parsing produces a morphological analysis (stem first,
followed by category of stem, followed by all affixes):
geese: goose + N + PL
goose: goose + N + SG, goose +V +3SG
ground: ground +N +SG, grind +V +PPart
Morphotactic recognition
Morphotactics is the syntax of morphemes: what order they
come in, what kind of units they make.
A basic morphotactic
fact about affixes is where they attach with
respect to the stem.
Prefix* + Stem + Suffix*
An affix is either a prefix or a suffix (English)
Plural -s is a suffix, un- is a prefix. There are also morphotactic facts
about what kinds of
things affixes attach to:
doability = do + able + ity
*doityable = do + ity + able
The affix -able attaches to a verb and produces an adjective. The affix
-ity attaches to an adjective and produces a noun.
Using FSAs to do recognition
- Word-class automaton: The morphotactics
- Lexicon assumed for recognition
- All stems, with word-class info attached
- fox, wine: reg-noun
- geese, fish, pants: irreg-pl-noun
- goose, fish, equipment, kindling: irreg-sg-noun
- All affixes
- Word automaton: A recognizer
- Recognizes "morphotactic strings", not surface strings of
English ("foxs", not "foxes"). A missing kind of
information.
- We need realizational
rules that tell us: fox + s => foxes.
On phonological representations these would be
allomorphy rules. On graphic representations
they're spelling rules.
Finite-State transducers
We introduce Finite-state transducers,
an augmentation of FSAs in which there are
two tapes
- A two-tape automaton,
the upper tape an "underlying representation" tape and the lower
a "surface representation"
The main idea: The FST is just as FSA on which the arcs are
labeled by pairs of symbols, an underlying symbol
and a surface symbol. An arc a can be taken
just in case the current symbol on the top
tape matches the underlying representation symbol on a
and the current symbol on the bottom tape
matches the surface representation symbol on a.
- A transition labeled a:e
means "a" as underlying representation corresponds to
"e" as surface representation
- A transition "a" (with no colon) is an abbreviation
for "a:a", meaning "a" as underlying representation
corresponds to "a" as surface representation.
Finite-State transducers(FSTs) give us the technology to do
parsing.
We imagine we start with
the surface representation tape containing
a surface word as input. Our job is to fill in an
underlying representation licensed by the FST.
Or, in ambiguous cases, to fill in ALL underlying
representations licensed by the FST.
Finite-State transducers give us the technology to do
generation of surface forms
We imagine we start with
the underlying representation tape containing
an underlying representation as input. This
may consist of a sequence of affixes and stems
and morphological features taken from the lexicon:
Our job is to find a
corresponding surface representation licensed by the FST,
if there is one. For this example there
isn't, but for
there is.
We will relate surface to underlying forms via
an "intermediate" morphotactic representation.
This will take two separate FSTs, one relating furface
to intermediate representations, one relating
intermediate to underlying representations.
Relating underlying to intermediate (morphotactic) representation:
- The morphological lexicon is just a large
FST relating stems to morphological word classes
- Stems related to word-class info
- fox, wine; reg-noun
- geese, fish, pants; irreg-pl-noun
- fish, equipment, kindling; irreg-sg-noun
- The complication is irregular plurals related to
an unpredictable singular:
g o:e o:e s e irreg-pl-noun
m o:i u:eps s:c e irreg-pl-noun
- The FST that captures this information
- We also need to relate word-class information
to morphological
features and morphotactic information:The
word-class transducer
- The "composition" of these two transducers
- Linguistically the picture is nice:
The underlying-to-intermediate transducer is itself composed out
of two transducers, a large one that captures morphological word classes
for stems, and a small one that that captures the morphotactics
of the word classes.
Relating intermediate (morphotactic) representation to surface (speeling
rules).
The problem
- beg + ing = begging: Consonant doubling
- mak + ing = making: e-deletion
- watch + s = watches: e-insertion
E-insertion: insert an e after a morpheme ending
in x, s, or z and before a word-final s.
Chomsky/Halle style "rewrite" rules:
E-insertion
Eps => e /{x,s,z}^ ____ s#
This can be modeled with an FST.
Interpreting the rule:
- ^ means a morpheme boundary.
- # means a word boundary.
- We assume word boundaries are "found" on the surface, but
morpheme boundaries are not (written language: English).
Thus every # transition is default (#:#), and every
^ transition is ^:eps.
- Other means, basically, not {s,x,z}, not ^, not #.
- This machine is designed to leave every word that the rule
does NOT apply to unchanged.
- State 0 is the irrelevant input state. We stay in
state 0 until we see something relevant. State 0
is a final state.
- State 1 means we found an {s,x,z}. State 2
means we've gone on to find a morpheme boundary.
- State 2 is the rule-ready state. We've seen
a morpheme end in {s,x,z}.
- The FST incorporates ONLY the information in our rewrite rule
(as written). It doesn't "know" anything else about English.
Some strange cases:
- The ^:eps transition from state 5 to state 2.
The rule requires e-insertion only before
word-final s. We reach state 5 having seen a
morpheme ending in {s,x,z} followed by a surface s.
This is okay (by the rule) as long as that s
isn't followed by a word boundary(#). So when in
state 5 we fail for word boundary. But
a morpheme boundary (^) is okay. And seeing
a morpheme boundary, we go back to state
2 because that's the rule-ready state, appropriate
when we've seen a morpheme ending in s.
- The {z,s,x} transition from 5 to 1. This is exercise 3.10.
Using the E-insertion rule when parsing:
- The correct parse
Underlying | | | | | | | |
State | | | | | | | |
Surface | a | s | s | e | s | s | # |
Underlying | a | s | s | e | s | s | # |
State | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
Surface | a | s | s | e | s | s | # |
- A false path
Underlying | a | s | s | ^ | eps | s | | |
State | 0 | 1 | 1 | 2 | 3 | 4 | fail | |
Surface | a | s | s | eps | e | s | s | # |