Introduction to Computational Linguistics

Top Down Parsing

Top-Down
Parsing
 

Think of this as an algorithm that takes one particular view of how to do a derivation. Suppose we are trying to parse the sentence John likes Richard M. Nixon with respect to the grammar:

  • S -> NP VP | VP
  • VP -> V NP | V PP
  • NP -> Art (Adj) N | PN
  • VP -> is/are Adj | V NP
  • PP -> P NP
  • PN -> Richard M Nixon | John
  • V -> likes | runs | eats
  • P -> in | on | under
  • Art -> the | a
  • Adj -> red | big
  • N -> house | garden | car

Non-terminals are symbols that occur on the LHS of a rule. The non-terminals in this grammar

    S, NP, VP, PP, P, Art, Adj, N, V, PN
Terminals are symbols that occur ONLY on the RHS of a rule. The non-terminals in this grammar
    is are Richard M Nixon John likes runs
    eats in on under the a red big
The pre-terminals are symbols that occur on the LHS of a rule introducing ONLY terminals. The pre-terminals are:
    N Art Adj PN V P
These correspond to our traditional notion of parts of speech in a natural language grammar.

The START Symbol in the grammar is S. Then a derivation that yields the sentence is:

    Rule Application
    START S
    S -> NP VP NP VP
    NP -> PN PN VP
    PN -> John John VP
    VP -> V NP John V NP
    V -> likes John likes NP
    NP -> PN John likes PN
    PN -> Richard M. Nixon. John likes Richard M. Nixon.

Derivations  

Notice there are many derivations that could produce the sentence from this grammar.

    Rule Application
    START S
    S -> NP VP NP VP
    NP -> PN PN VP
    VP -> V NP PN V NP
    NP -> PN PN V PN
    PN -> John John V PN
    V -> likes John likes PN
    PN -> Richard M. Nixon John likes Richard M. Nixon

Both derivations start with the start symbol and begin a sequence of re-writes, terminating with a line that consists ONLY of terminals. Therefore these are top-down derivations.

The first derivation picked a particular symbol and kept re-writing until the result waas all terminals, then picked th enext symbol, and so on, moving left to right. This is a top-down left-to-right depth first derivation.

The second derivation just rewrote symbols in order of occurrence from left to right. This is a top-down left-to-right breadth first derivation.

Bottum-up
Derivation
 

Derivations do not have to start with the start symbol in order to fulfill the role we want of them (justifying that a particular striong is licensed by a particular grammar).

We can also start with the string and work our way to the start symbol.

    Rule Application
    PN -> Richard M. Nixon. John likes Richard M. Nixon.
    NP -> PN John likes PN
    V -> likes John likes NP
    VP -> V NP John V NP
    PN -> John John VP
    NP -> PN PN VP
    S -> NP VP NP VP
    START S

Notice this is just exactly the reverse of the first top-down depth-first derivation.

Trees  

A derivation can be associated with a tree in a fairly obvious way.

All three of the derivations we just looked at produced the same tree:

It turns out we don't usually care what order the steps in a derivation were found. What we're usually after is just the groupings of phrases they describe, because this is what is connected to meaning. Thus what we want is the tree not the derivation.

This is true in parsing computer languages as well:

Derivations are still very useful though because computing a tree will always mean finding a particular derivation (the steps have to be done in some order!). So they help us classify parsing algorithms.

Parsing  

We can now define what parsers and recoignizers are.

Recognizers determine whether a particular string is licsned by a particular grammar. They give a yes/no answer. They do this by finding a derivation.

Parsers find the tree or trees for a string licensed by the grammar. They also do this by finding a derivation.

So the two kinds of algorithms are very cloisely connected.

Parser states:
Recognition
 

We pair a list of categories we are trying to find with the span of input the search starts at:

    (S (John likes Richard M. Nixon))

A parser state is thus the current goal of the computation. It consists of two pieces of information:

  1. What we're looking for: A sequence of categories
  2. Where we're looking for it: the string we are trying to assign that analysis to (how far we've gotten in finding a derivation!)

For topdown parsing, we always start in the following state:

  1. S [or whatever "start" symbol of grammar is]
  2. the entire input string

Here is the simple grammar from above:

Recognition   Here are the steps of the recognition computation for John likes R.M.N..
    Step Curr. State State
    Stack
    Rewrite/
    Found
    1. ((S) (John likes R.M.N.))   initial state
    2. ((NP VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.)) S -> NP VP (current state)
    S -> VP (state stack)
    3. ((Art N VP) (John likes R.M.N.)) ((Art Adj N VP) (John likes R.M.N.))
    ((PN VP) (John likes R.M.N.))
    ((VP) (John likes R.M.N.))
    NP -> Art N (current state)
    NP -> Art Adj N (state stack)
    NP -> PN (state stack)
    4. ((Art Adj N VP) (John likes R.M.N.)) ((PN VP) (John likes R.M.N.))
    ((VP) (John likes R.M.N.))
    Popped state stack!!
    "John" is not an Art
    5. ((PN VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.)) Popped state stack!!
    "John" is not an Art
    6. ((VP) (likes R.M.N.)) ((VP) (John likes R.M.N.)) Lexical pop!
    "John" is a PN
    7. ((V NP) (likes R.M.N.)) ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    VP -> V NP
    VP -> V PP
    8. ((NP) (R.M.N.)) ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    Lexical pop!
    "likes" is a V
    9. ((Art N) (R.M.N.)) ((Art Adj N) (R.M.N.))
    ((PN) (R.M.N.))
    ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    NP -> Art N (current state)
    NP -> Art Adj N (backup state)
    NP -> PN (backup state)
    10. ((Art Adj N) (R.M.N.)) ((PN) (R.M.N.))
    ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    Popped state stack!!
    "Richard" is not an Art
    11. ((PN) (R.M.N.)) ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    Popped state stack!!
    "Richard" is not an Art
    12. (() ()) ((V PP) (likes R.M.N.))
    ((VP) (John likes R.M.N.))
    Lexical pop!
    "R.M.N." is a PN
    Done!!
Parser states:
Parsing
 

A triple: Node list, input, tree:

    (S (John likes Richard M. Nixon) •)

We start in the following state:

  1. S [or whatever "start" symbol of grammar is]
  2. the entire input string
  3. • [the part of the tree we are currently building, start building top node]
Parsing   Here are the steps of the parsing computation for John likes R.M.N..
    Step Curr. State State
    Stack
    1. ((S) (John likes R.M.N.) •)  
    2. ((NP VP) (John likes R.M.N.) (S •)) ((VP) (John likes R.M.N.)(S •))
    3. ((Art N VP) (John likes R.M.N.)(S (NP •))) ((Art Adj N VP) (John likes R.M.N.)(S (NP •)))
    ((PN VP) (John likes R.M.N.)(S (NP •)))
    ((VP) (John likes R.M.N.)(S •))
    4. ((Art Adj N VP) (John likes R.M.N.)(S (NP •))) ((PN VP) (John likes R.M.N.)(S (NP •)))
    ((VP) (John likes R.M.N.)(S •))
    5. ((PN VP) (John likes R.M.N.)(S (NP •)) ((VP) (John likes R.M.N.)(S •))
    6. ((VP) (likes R.M.N.)(S (NP (PN John)) •)) ((VP) (John likes R.M.N.)(S •))
    7. ((V NP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •)))
    ((VP) (John likes R.M.N.)(S •))
    8. ((NP) (R.M.N.)(S (NP (PN John))(VP (V likes) •))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •)))
    ((VP) (John likes R.M.N.)(S •))
    9. ((Art N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((Art Adj N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))
    ((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))
    ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •)))
    ((VP) (John likes R.M.N.)(S •))
    10. ((Art Adj N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))
    ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •)))
    ((VP) (John likes R.M.N.)(S •))
    11. ((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •)))
    ((VP) (John likes R.M.N.)(S •))
    12. ( () () (S (NP (PN John))(VP (V likes) (NP (PN John))))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP • )))
    ((VP) (John likes R.M.N.) (S •))
    13. ((V PP) (likes R. M. N.)(S (NP (PN John))(VP • ))) ((VP) (John likes R. M. N.)(S •))
    14. ((PP) (R.M.N.)(S (NP (PN John))(VP (V likes) • ))) ((VP) (John likes R.M.N.)(S •))
    15. ((P NP) (R.M.N.)(S (NP (PN John))(VP (V likes) (PP •) ))) ((VP) (John likes R.M.N.)(S •))
    16. (VP (John likes R.M.N.)(S •))  
    18. ((V NP) (John likes R.M.N.)(S (VP •))) ((V PP) (John likes R.M.N.)(S (VP •)))
    19. ((V PP) (John likes R.M.N.)(S (VP •)))  
    19.    

    Pseudocode
    Version
     

      Pseudocode for topdown parser.
    Note: This version has lots of errors. Please check Text errata for corrections:

    Pesudocode (my version, disagrees even with corrected version):

      function TOP-DOWN-PARSE (Input, Grammar) returns a parse tree
        agenda <- (Initial S tree, Beginning of input)
        current-search-state <- POP(agenda)
        loop
          if SUCCESSFUL-PARSE?(current-search-state) then
             return TREE(current-search-state)
          else
             if CAT-OF-NODE(current-search-state) is a POS then
                      if  CAT-OF-NODE(current-search-state) 
                          in POS(CURRENT-INPUT(current-search-state)) 
                      then
                       PUSH(LEXICAL-POP(current-search-state), agenda)
                      else
                         current-search-state <- POP(agenda)
             else
                 expansions <- EXPANSIONS(CAT-OF-NODE(current-search-state))
                 current-expansion <- POP(expansions)
                 current-search-state <- 
                     EXPAND-CURRENT-NODE(current-search-state, 
                                        current-expansion)
                 agenda <- APPEND(expansions, agenda)
        end
      
    Python
    Version
      Assume a Parser class. Full-blown code is given here.

    In the following code many of the function definitions begin with long multi-line strings. These are Python self-doumentations strings. You can look at the documentation of ANY python method or function by look at its __doc__ attribute. For example, assume the definition of recognize_string below, and assume it is a method defined for some class Parser, and that "p1" is an instance of that class. Then I could look at the documentation of recognize_string as follows:

    >>> print p1.recognize_string.__doc__
    
            {string}: A string.  The string to be parsed by current goals.
                    Immediately converted to a [WordList].by self.legal_input.
                    
            : The result of calling recurse_recognize on the start-cat,
            the empty agenda, and the wordlist generated from string.
    

    Now lets look at the code of a simple top down recognizer (note: the code below differs in minor details from the code in the full-blown implementation, mostly in calls to trace functions).

        def recognize_string(self, string=''):
            """
            {string}: A string.  The string to be parsed.
                      Immediately converted to a [WordList] by self.legal_input.
                    
            [Return]: The result of calling recurse_recognize on the start-cat,
            the empty agenda, and the wordlist generated from string.
            """
            if self.legal_input(string):
                return self.recurse_recognize([self.start_cat],[],self.input)
            else:
                return False
    
        def recurse_recognize(self,goals,agenda,wordlist):
            """
            {goals}: [GoalList]: A list of categories and/or words proposed to cover
                                 wordlist
            {agenda}: A list of [ParserState]s: each a pair of ([GoalList],[WordList])
            {wordlist}: a list of words.
            [Return]: True iff {goals} and {input} are empty
    
            Suppose {goals} is non-empty:
            
            If {goals}[0] is a nonterminal, call expand (expand it with the grammar
            and continue parsing recursively).  If {goals}[0] is a terminal
            then call_match_word (try to match the next word and continue parsing).
            Both expand and match_word contain recursive calls to recurse_recognize.
    
            Suppose {goals} is empty:
    
            Then if {input} is empty, [Return] True. Else backtrack (If {agenda} is non-empty
            continue parsing with the next state on the agenda; else [Return] False)
            backtrack also contains a recursive call to recurse recignize.
            """
            if len(goals) > 0:
                next_goal = goals[0]
                if not next_goal in self.terminals:
                    return self.expand(next_goal,goals[1:],agenda,wordlist)
                else:
                    return self.match_word(next_goal,wordlist,goals[1:],agenda)
            elif len(wordlist) == 0:
                return True
            else:
                return self.backtrack(agenda)
    
        def expand(self,next_goal,rest_goals,agenda,wordlist):
            """
            Expand next_goal using grammar (self.productions)
            Generate new GoalList using one production and add unused productions
            to the agenda. Keep parsing with new GoalList, new agenda, old WordList.
            """
            productions = self.productions[next_goal]
            next_production = productions[0]
            rest_productions = productions[1:]
            for p in rest_productions:
                agenda = [ParserState(p+rest_goals,wordlist)] + agenda
            return self.recurse_recognize(next_production+rest_goals, agenda, wordlist)
    
        def match_word(self,next_goal,wordlist,rest_goals,agenda):
            """
            {next_goal} is a terminal.
    
            Suppose {wordlist} is non-empty:
            
            If next_goal matches {wordlist}[0],  keep parsing with {rest_goals}, {agenda}
            and {wordlist}[1:]; else this parse path fails;  call backtrack (if {agenda}
            is non-empty, continue parsing with the next parser state
            on {agenda}, and if {agenda} is empty, [Return]: False)
    
            Suppose wordlist is empty
    
            This parse path fails.  Call backtrack
            """
            if len(wordlist) > 0:
                     if next_goal == wordlist[0]:
                        ## We just matched a word! Discard and keep parsing rest of wordlist
                        return self.recurse_recognize(rest_goals,agenda,wordlist[1:])
                     else:
                        return self.backtrack(agenda)
            else:
                return self.backtrack(agenda)
            
    
        def backtrack(self, agenda):
            """
            Suppose {agenda} is non-empty:
    
            Then
              agenda[0].GoalList = new goals
              agenda[0].WordList = new wordlist
            Keep parsing with new goals and new worldlist and popped agenda.
    
            Suppose {agenda} is empty
    
            return False
            """
            if len(agenda) > 0:
                return self.recurse_recognize(agenda[0].GoalList,agenda[1:],agenda[0].WordList)
            else:
                return False