|
Morphemes
|
Words have meaningful sub-parts that aren't all words:
| unfamiliarity |
| un | familiar | ity |
|
|
| undeniability |
| un | deny | able | ity |
|
|
Morpheme
Structure
|
| Right |
Wrong |
| deny |
| deny | able |
| un | deny | able |
| un | deny | able | ity |
| undeniability |
|
| deny |
| un | deny |
| un | deny | able |
| un | deny | able | ity |
| undeniability |
|
|
|
Stems
|
| walked |
| walk | ed |
| Stem | Suffix |
Stem may define a word that has multiple forms:
Stem may be more than one morpheme:
| formalized |
| formal | ed |
| Stem | Suffix |
| form | al | ed |
|
Morphological
Analyzer
Versus
Stemmer
|
Two kinds of programs:
- Morphological Analyzer: Not only takes a word apart into its
component "atoms" (morphemes), but also produces the right
"structure".
Right: formalize = [[form + al] + ize]
Wrong: formalize = [form + [al + ize]]
- Stemmer: Finds stems
formalize => formal
formalized => formalize
formalized => formal
|
English Verb Forms
(Stemming)
|
|   | Form | Stem |
Present
Participle | walking | walk |
Past
Participle | walked | walk |
| Past Form | walked | walk |
Present
Tense | walks | walk |
|
|   | Form | Stem |
breaking
  | break |
broke
  | break |
broken
  | break |
breaks
  | break |
|
|   | Form | Stem |
bleeding
  | bleed |
bled
  | bleed |
bled
  | bleed |
bleeds
  | bleed |
|
|
Inflectional
Morphology
|
Inflectional morphology is about the different forms of a
single word:
Inflectional Verb
| break |
-> | breaks | 3rd person present tense singular |
| breaking | present participle |
| broken | present participle |
| break | 3rd person present tense plural |
| broke | past tense |
Inflectional Noun
| cat |
-> | cat | Singular |
| cats | Singular |
Inflectional Adjective
| happy |
-> | happy | Positive |
| happier | Comparative |
| happiest | Superlative |
|
Derivational
Morphology
|
Dervivational morphology is about making new
words.
| relational | -> | relate |
| conditional | -> | condition |
| digitizer | -> | digitize |
| predication | -> | predicate |
| hopefulness | -> | hopeful |
| feudalism | -> | feudal |
| formalize | -> | formal |
| feudalism | -> | feudal |
| allowance | -> | allow |
| effective | -> | effect |
|
Measure of
a Word
|
- Let vowels be:
A E I O U and Y (after non- A,E,I,O,U)
try: "y" is a vowel
bay: "y" is not a vowel
- Let consonants be anything other than a vowel.
- Let V stand for any sequence of vowels.
- Let C stand for any sequence of consonants.
- We can represent any word by the pattern:
(C)(VC)m(V)
Parentheses here mean optionality. That is words can skip
the initial consonant sequence C and start witha vowel
sequence("at"), they can skip the VC sequence and
go directly to a final vowel ("tree"), or
they skip the final vowel ("cat").
Here "m" is Kleene-operator m. We allow any number
of VC sequences ( each a "syllable").
- WE call the value of "m" for a particular
word its measure.
Some of examples of measure:
| m=0 | tr, ee,tree,by |
| m=1 | trouble,oats,trees,ivy |
| m=2 | troubles,private,oaten,orrery,biases |
| m=3 | intrusion,orreries |
|
Algorithm
Notation
|
| m | the measure of the stem |
| *S | the stem ends with S (and similarly for
other letters) |
| *v* | The stem contains a vowel |
| *d | the stem ends with a double consonant (e.g. -TT or -SS) |
| *o | the stem ends in CV1C, where the second C is
not W, X, or Y (e.g. -WIL, -HOP)
|
|
|
Stages of
Algorithm
|
- Plural Nouns and Third Person Singular verbs
(-s ending)
| cats | => | cat |
| walks | => | walk |
- Verbal Past tense and progressive forms
| walked | => | walk |
| walking | => | walk |
- Y => I
happy => happi
- Derivational Morphology I:
Multiple suffixes (-TIONAL, -ENCY, -ALLY, -IZATION,
-ALISM.etc.)
- Derivational Morphology II:
More Multiple suffixes (-ICATE, -ATIVE, .etc.)
- Derivational Morphology III:
Single suffixes (-AL,-ANCE,-ENCE, -IC,-ABLE)
- Cleanup
|
Stage I:
Third Present Verbs,
Plural Noun
|
No conditions for these rules:
Only one rule
per set may apply to a given word.
The rule with the LONGEST MATCH should apply.
This is what explains rule (c) below.
Any word that matches (c) will not undergo (d).
|
 
|
Rule
|
Example
|
|
(a)
|
SSES->SS
|
caresses => carress
|
|
(b)
|
IES->I
|
ponies=> poni
ties=> ti
|
|
(c)
|
SS->SS
|
ponies=> poni
ties=> ti
|
|
(d)
|
S->eps
|
cats=> cat
|
|
Stage IIa
Past and
Progressive
Verbs
|
|
 
|
Condition
|
Rule
|
Example
|
|
(a)
|
(m > 1)
|
EED->EE
|
feed => feed
agreed => agree
|
|
(b)
|
(*v*)
|
ED->eps
|
plastered=> plaster
bled => bled
|
|
(c)
|
(*v*)
|
ING->eps
|
try=> try
spring => spring
|
|
Stage IIb
Cleanup
Consonant
Doubling,
E-insertion
|
|
 
|
Condition
|
Rule
|
Example
|
|
(a)
|
 
|
AT => ATE
|
conflat(ed) => conflate
|
|
(b)
|
 
|
BL->BLE
|
troubl(ing) => trouble
|
|
(c)
|
 
|
IZ->IZE
|
siz(ed) => size
|
|
(d)
|
(*d & !(*L or *S or *Z))
|
CC -> C
|
hopp(ing) => hop
tann(ing) => tan
fall(ing) => fall
hiss(ing) => hiss
fizz(ing) => fizz
|
|
(e)
|
(m=1 & *o)
|
eps -> e
|
fail(ing) => fail
fil(ing) => file
|
|
Stage III:
Y => I
|
|
(*v*)
|
Y -> I
|
happy => happi
sky => sky
|
|
Stage IV:
Suffixes
|
|
Condition
|
Rule
|
Examples
|
|
(m > 0)
|
ATIONAL => ATE
|
relational => relate
|
|
(m > 0)
|
ENCI => ENCE
|
valenci => valence
|
|
(m > 0)
|
IZER => IZE
|
digitizer => digitize
|
|
(m > 0)
|
ABLI => ABLE
|
comfortably => comfortable
|
|
(m > 0)
|
ALLI => AL
|
radicalli
=>
radical
|
|
(m > 0)
|
ENTLI => ENT
|
differentli
=>
different
|
|
(m > 0)
|
IZATION => IZE
|
vietnamization
=>
vietnamize
|
|
(m > 0)
|
ELI => E
|
vileli
=>
vile
|
|
(m > 0)
|
OUSLI => OUS
|
analogousli
=>
analogous
|
|
(m > 0)
|
ATION => ATE
|
predication
=>
predicate
|
|
(m > 0)
|
ATOR => ATE
|
operator
=>
operate
|
|
(m > 0)
|
ALISM => AL
|
feudalism
=>
feudal
|
|
(m > 0)
|
IVENESS => IVE
|
decisiveness
=>
decisive
|
|
(m > 0)
|
FULNESS => FUL
|
hopefulness
=>
hopeful
|
|
(m > 0)
|
OUSNESS => OUS
|
callousness
=>
callous
|
|
(m > 0)
|
ALITI => AL
|
formaliti
=>
formal
|
|
(m > 0)
|
IVITI => IVE
|
sensitiviti
=>
sensitive
|
|
(m > 0)
|
BILITI => BLE
|
sensibility
=>
sensible
|
|
Stage V:
Suffixes
|
|
Condition
|
Rule
|
Examples
|
|
(m > 0)
|
ICATE => IC
|
triplicate => triplic
|
|
(m > 0)
|
ATIVE => eps
|
formative => form
|
|
(m > 0)
|
ALIZE => AL
|
formalize => form
|
|
(m > 0)
|
ICITI => IC
|
electriciti => electric
|
|
(m > 0)
|
FUL => eps
|
hopeful => hope
|
|
(m > 0)
|
NESS => eps
|
goodness => good
|
|