Linguistics 596 (to be Linguistics 571)

Porter Stemming Algorithm

Motivations and Background

Morphemes

Words have meaningful sub-parts that aren't all words:

    unfamiliarity
    un familiar ity
    formalize
    form al ize
    undeniability
    un deny able ity
Morpheme
Structure

    Right Wrong
    deny
    deny able
    un deny able
    un deny able ity
    undeniability
    deny
    un deny
    un deny able
    un deny able ity
    undeniability
Stems
    walked
    walk ed
    Stem Suffix
Stem may define a word that has multiple forms:
    walked
    walking
    walks
Stem may be more than one morpheme:
    formalized
    formal ed
    Stem Suffix
    form al ed
Morphological
Analyzer
Versus
Stemmer

Two kinds of programs:

  1. Morphological Analyzer: Not only takes a word apart into its component "atoms" (morphemes), but also produces the right "structure".
      Right: formalize = [[form + al] + ize]
      Wrong: formalize = [form + [al + ize]]
  2. Stemmer: Finds stems
      formalize => formal
      formalized => formalize
      formalized => formal
English Verb Forms
(Stemming)

      Form Stem
    Present
    Participle
    walking walk
    Past
    Participle
    walked walk
    Past Formwalked walk
    Present
    Tense
    walks walk
      Form Stem
    breaking
     
    break
    broke
     
    break
    broken
     
    break
    breaks
     
    break
      Form Stem
    bleeding
     
    bleed
    bled
     
    bleed
    bled
     
    bleed
    bleeds
     
    bleed
Inflectional
Morphology

Inflectional morphology is about the different forms of a single word:

    Inflectional Verb
    break -> breaks 3rd person present tense singular
    breaking present participle
    broken present participle
    break 3rd person present tense plural
    broke past tense
    Inflectional Noun
    cat -> cat Singular
    cats Singular
    Inflectional Adjective
    happy -> happy Positive
    happier Comparative
    happiest Superlative
Derivational
Morphology

Dervivational morphology is about making new words.

    relational -> relate
    conditional -> condition
    digitizer -> digitize
    predication -> predicate
    hopefulness -> hopeful
    feudalism -> feudal
    formalize -> formal
    feudalism -> feudal
    allowance -> allow
    effective -> effect
Measure of
a Word

  1. Let vowels be:
      A E I O U and Y (after non- A,E,I,O,U)
      try: "y" is a vowel
      bay: "y" is not a vowel
  2. Let consonants be anything other than a vowel.
  3. Let V stand for any sequence of vowels.
  4. Let C stand for any sequence of consonants.
  5. We can represent any word by the pattern:
      (C)(VC)m(V)
    Parentheses here mean optionality. That is words can skip the initial consonant sequence C and start witha vowel sequence("at"), they can skip the VC sequence and go directly to a final vowel ("tree"), or they skip the final vowel ("cat").

    Here "m" is Kleene-operator m. We allow any number of VC sequences ( each a "syllable").

  6. WE call the value of "m" for a particular word its measure.
Some of examples of measure:
m=0 tr, ee,tree,by
m=1 trouble,oats,trees,ivy
m=2 troubles,private,oaten,orrery,biases
m=3 intrusion,orreries
Algorithm
Notation

m the measure of the stem
*S the stem ends with S (and similarly for other letters)
*v* The stem contains a vowel
*d the stem ends with a double consonant (e.g. -TT or -SS)
*o the stem ends in CV1C, where the second C is not W, X, or Y (e.g. -WIL, -HOP)

Stages of Algorithm

  1. Plural Nouns and Third Person Singular verbs (-s ending)
    cats => cat
    walks => walk
  2. Verbal Past tense and progressive forms
    walked => walk
    walking => walk
  3. Y => I
      happy => happi
  4. Derivational Morphology I: Multiple suffixes (-TIONAL, -ENCY, -ALLY, -IZATION, -ALISM.etc.)
  5. Derivational Morphology II: More Multiple suffixes (-ICATE, -ATIVE, .etc.)
  6. Derivational Morphology III: Single suffixes (-AL,-ANCE,-ENCE, -IC,-ABLE)
  7. Cleanup

The Algorithm

Stage I:
Third Present Verbs,
Plural Noun

No conditions for these rules: Only one rule per set may apply to a given word. The rule with the LONGEST MATCH should apply. This is what explains rule (c) below. Any word that matches (c) will not undergo (d).

      Rule Example
    (a) SSES->SS caresses => carress
    (b) IES->I ponies=> poni
    ties=> ti
    (c) SS->SS ponies=> poni
    ties=> ti
    (d) S->eps cats=> cat
Stage IIa
Past and
Progressive
Verbs

      Condition Rule Example
    (a) (m > 1) EED->EE feed => feed
    agreed => agree
    (b) (*v*) ED->eps plastered=> plaster
    bled => bled
    (c) (*v*) ING->eps try=> try
    spring => spring
Stage IIb
Cleanup
Consonant
Doubling,
E-insertion

      Condition Rule Example
    (a)   AT => ATE conflat(ed) => conflate
    (b)   BL->BLE troubl(ing) => trouble
    (c)   IZ->IZE siz(ed) => size
    (d) (*d & !(*L or *S or *Z)) CC -> C hopp(ing) => hop
    tann(ing) => tan
    fall(ing) => fall
    hiss(ing) => hiss
    fizz(ing) => fizz
    (e) (m=1 & *o) eps -> e fail(ing) => fail
    fil(ing) => file
Stage III:
Y => I

    (*v*) Y -> I happy => happi
    sky => sky
Stage IV:
Suffixes

    Condition Rule Examples
    (m > 0) ATIONAL => ATE relational => relate
    (m > 0) ENCI => ENCE valenci => valence
    (m > 0) IZER => IZE digitizer => digitize
    (m > 0) ABLI => ABLE comfortably => comfortable
    (m > 0) ALLI => AL radicalli => radical
    (m > 0) ENTLI => ENT differentli => different
    (m > 0) IZATION => IZE vietnamization => vietnamize
    (m > 0) ELI => E vileli => vile
    (m > 0) OUSLI => OUS analogousli => analogous
    (m > 0) ATION => ATE predication => predicate
    (m > 0) ATOR => ATE operator => operate
    (m > 0) ALISM => AL feudalism => feudal
    (m > 0) IVENESS => IVE decisiveness => decisive
    (m > 0) FULNESS => FUL hopefulness => hopeful
    (m > 0) OUSNESS => OUS callousness => callous
    (m > 0) ALITI => AL formaliti => formal
    (m > 0) IVITI => IVE sensitiviti => sensitive
    (m > 0) BILITI => BLE sensibility => sensible
Stage V:
Suffixes

    Condition Rule Examples
    (m > 0) ICATE => IC triplicate => triplic
    (m > 0) ATIVE => eps formative => form
    (m > 0) ALIZE => AL formalize => form
    (m > 0) ICITI => IC electriciti => electric
    (m > 0) FUL => eps hopeful => hope
    (m > 0) NESS => eps goodness => good
Stage VI:
Suffixes

    Condition Rule Examples
    (m > 0) AL => ep revival => reviv
    (m > 0) ANCE => eps allowance => allow
    (m > 0) ENCE => eps inference => infer
    (m > 0) ER => eps airliner => airlin
    (m > 0) IC => eps gyroscopic => gyroscop
    (m > 0) ABLE => eps doable => do
    (m > 0) ANT => eps irritant => irrit
    (m > 0) EMENT => eps replacement => replac
    (m > 0) MENT => eps adjustment => adjust
    (m > 0) ENT => eps dependent => depend
    (m > 0) (*S or *T)&ION => eps adoption => adopt
    emission => emi [???]
    (m > 0) OU => eps homologou => homolog
    (m > 0) ISMR => eps communism => commun
    (m > 0) ATE => eps activate => aactiv
    (m > 0) ITI => eps angularity => angular
    (m > 0) OUS => eps homologous => homolog
    (m > 0) IVE => eps effective => effect
    (m > 0) IZE => eps bowdlerize => bowdler
    realize => real[??]
Stage VIIa
Cleanup

      Condition Rule Example
    (a) (m > 1) E -> eps probate -> probat
    rate -> rate
    (b) (m=1 & !*o) E -> eps cease => ceas
Stage VIIb
Cleanup

    Condition Rule Example
    (m > 1 & *d *L) CC -> C controll -> control
    roll -> roll