Introduction to Computational Linguistics

Top Down Parsing

Top-Down
Parsing
Think of this as an algorithm that takes one particular view of how to do a derivation. Suppose we are trying to parse the sentence John likes Richard M. Nixon with respect to the grammar:

S -> NP VP | VP
VP -> V NP | V PP
NP -> Art (Adj) N | PN
VP -> is/are Adj | V NP
PP -> P NP
PN -> Richard M Nixon | John
V -> likes | runs | eats
P -> in | on | under
Art -> the | a
Adj -> red | big
N -> house | garden | car

Non-terminals are symbols that occur on the LHS of a rule. The non-terminals in this grammar
S, NP, VP, PP, P, Art, Adj, N, V, PN
Terminals are symbols that occur ONLY on the RHS of a rule. The non-terminals in this grammar
is are Richard M Nixon John likes runs
eats in on under the a red big
The pre-terminals are symbols that occur on the LHS of a rule introducing ONLY terminals. The pre-terminals are:
N Art Adj PN V P
These correspond to our traditional notion of parts of speech in a natural language grammar.
The START Symbol in the grammar is S. Then a derivation that yields the sentence is:

Rule Application
START S

S -> NP VP NP VP

NP -> PN PN VP

PN -> John John VP

VP -> V NP John V NP

V -> likes John likes NP

NP -> PN John likes PN

PN -> Richard M. Nixon. John likes Richard M. Nixon.

Derivations
Notice there are many derivations that could produce the sentence from this grammar.

Rule Application
START S

S -> NP VP NP VP

NP -> PN PN VP

VP -> V NP PN V NP

NP -> PN PN V PN

PN -> John John V PN

V -> likes John likes PN

PN -> Richard M. Nixon John likes Richard M. Nixon

Both derivations start with the start symbol and begin a sequence of re-writes, terminating with a line that consists ONLY of terminals. Therefore these are top-down derivations.
The first derivation picked a particular symbol and kept re-writing until the result waas all terminals, then picked th enext symbol, and so on, moving left to right. This is a top-down left-to-right depth first derivation.
The second derivation just rewrote symbols in order of occurrence from left to right. This is a top-down left-to-right breadth first derivation.

Bottum-up
Derivation
Derivations do not have to start with the start symbol in order to fulfill the role we want of them (justifying that a particular striong is licensed by a particular grammar).
We can also start with the string and work our way to the start symbol.

Rule Application
PN -> Richard M. Nixon. John likes Richard M. Nixon.

NP -> PN John likes PN

V -> likes John likes NP

VP -> V NP John V NP

PN -> John John VP

NP -> PN PN VP

S -> NP VP NP VP

START S

Notice this is just exactly the reverse of the first top-down depth-first derivation.

Trees
A derivation can be associated with a tree in a fairly obvious way.
All three of the derivations we just looked at produced the same tree:

It turns out we don't usually care what order the steps in a derivation were found. What we're usually after is just the groupings of phrases they describe, because this is what is connected to meaning. Thus what we want is the tree not the derivation.
This is true in parsing computer languages as well:

Derivations are still very useful though because computing a tree will always mean finding a particular derivation (the steps have to be done in some order!). So they help us classify parsing algorithms.

Parsing
We can now define what parsers and recoignizers are.
Recognizers determine whether a particular string is licsned by a particular grammar. They give a yes/no answer. They do this by finding a derivation.
Parsers find the tree or trees for a string licensed by the grammar. They also do this by finding a derivation.
So the two kinds of algorithms are very cloisely connected.

Parser states:
Recognition
We pair a list of categories we are trying to find with the span of input the search starts at:
(S (John likes Richard M. Nixon))

A parser state is thus the current goal of the computation. It consists of two pieces of information:

What we're looking for: A sequence of categories
Where we're looking for it: the string we are trying to assign that analysis to (how far we've gotten in finding a derivation!)

For topdown parsing, we always start in the following state:

S [or whatever "start" symbol of grammar is]
the entire input string

Here is the simple grammar from above:

S -> NP VP | VP
VP -> V NP | V PP
NP -> Art (Adj) N | PN
VP -> is/are Adj | V NP
PP -> P NP
PN -> Richard M Nixon | John
V -> likes | runs | eats
P -> in | on | under
Art -> the | a
Adj -> red | big
N -> house | garden | car

Recognition Here are the steps of the recognition computation for John likes R.M.N..

Step	Curr. State	State Stack	Rewrite/ Found
1.	((S) (John likes R.M.N.))		initial state
2.	((NP VP) (John likes R.M.N.))	((VP) (John likes R.M.N.))	S -> NP VP (current state) S -> VP (state stack)
3.	((Art N VP) (John likes R.M.N.))	((Art Adj N VP) (John likes R.M.N.)) ((PN VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.))	NP -> Art N (current state) NP -> Art Adj N (state stack) NP -> PN (state stack)
4.	((Art Adj N VP) (John likes R.M.N.))	((PN VP) (John likes R.M.N.)) ((VP) (John likes R.M.N.))	Popped state stack!! "John" is not an Art
5.	((PN VP) (John likes R.M.N.))	((VP) (John likes R.M.N.))	Popped state stack!! "John" is not an Art
6.	((VP) (likes R.M.N.))	((VP) (John likes R.M.N.))	Lexical pop! "John" is a PN
7.	((V NP) (likes R.M.N.))	((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	VP -> V NP VP -> V PP
8.	((NP) (R.M.N.))	((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	Lexical pop! "likes" is a V
9.	((Art N) (R.M.N.))	((Art Adj N) (R.M.N.)) ((PN) (R.M.N.)) ((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	NP -> Art N (current state) NP -> Art Adj N (backup state) NP -> PN (backup state)
10.	((Art Adj N) (R.M.N.))	((PN) (R.M.N.)) ((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	Popped state stack!! "Richard" is not an Art
11.	((PN) (R.M.N.))	((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	Popped state stack!! "Richard" is not an Art
12.	(() ())	((V PP) (likes R.M.N.)) ((VP) (John likes R.M.N.))	Lexical pop! "R.M.N." is a PN Done!!

Parser states:
Parsing

A triple: Node list, input, tree:

(S (John likes Richard M. Nixon) •)

We start in the following state:

S [or whatever "start" symbol of grammar is]
the entire input string
• [the part of the tree we are currently building, start building top node]

Parsing

Here are the steps of the parsing computation for John likes R.M.N..

Step	Curr. State	State Stack
1.	((S) (John likes R.M.N.) •)
2.	((NP VP) (John likes R.M.N.) (S •))	((VP) (John likes R.M.N.)(S •))
3.	((Art N VP) (John likes R.M.N.)(S (NP •)))	((Art Adj N VP) (John likes R.M.N.)(S (NP •))) ((PN VP) (John likes R.M.N.)(S (NP •))) ((VP) (John likes R.M.N.)(S •))
4.	((Art Adj N VP) (John likes R.M.N.)(S (NP •)))	((PN VP) (John likes R.M.N.)(S (NP •))) ((VP) (John likes R.M.N.)(S •))
5.	((PN VP) (John likes R.M.N.)(S (NP •))	((VP) (John likes R.M.N.)(S •))
6.	((VP) (likes R.M.N.)(S (NP (PN John)) •))	((VP) (John likes R.M.N.)(S •))
7.	((V NP) (likes R.M.N.)(S (NP (PN John))(VP •)))	((V PP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((VP) (John likes R.M.N.)(S •))
8.	((NP) (R.M.N.)(S (NP (PN John))(VP (V likes) •)))	((V PP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((VP) (John likes R.M.N.)(S •))
9.	((Art N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))	((Art Adj N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((VP) (John likes R.M.N.)(S •))
10.	((Art Adj N) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))	((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •)))) ((V PP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((VP) (John likes R.M.N.)(S •))
11.	((PN) (R.M.N.)(S (NP (PN John))(VP (V likes) (NP •))))	((V PP) (likes R.M.N.)(S (NP (PN John))(VP •))) ((VP) (John likes R.M.N.)(S •))
12.	( () () (S (NP (PN John))(VP (V likes) (NP (PN John)))))	((V PP) (likes R.M.N.)(S (NP (PN John))(VP • ))) ((VP) (John likes R.M.N.) (S •))
13.	((V PP) (likes R. M. N.)(S (NP (PN John))(VP • )))	((VP) (John likes R. M. N.)(S •))
14.	((PP) (R.M.N.)(S (NP (PN John))(VP (V likes) • )))	((VP) (John likes R.M.N.)(S •))
15.	((P NP) (R.M.N.)(S (NP (PN John))(VP (V likes) (PP •) )))	((VP) (John likes R.M.N.)(S •))
16.	(VP (John likes R.M.N.)(S •))
18.	((V NP) (John likes R.M.N.)(S (VP •)))	((V PP) (John likes R.M.N.)(S (VP •)))
19.	((V PP) (John likes R.M.N.)(S (VP •)))
19.

Pseudocode Version		Pseudocode for topdown parser. Note: This version has lots of errors. Please check Text errata for corrections: Pesudocode (my version, disagrees even with corrected version): function TOP-DOWN-PARSE (Input, Grammar) returns a parse tree agenda <- (Initial S tree, Beginning of input) current-search-state <- POP(agenda) loop if SUCCESSFUL-PARSE?(current-search-state) then return TREE(current-search-state) else if CAT-OF-NODE(current-search-state) is a POS then if CAT-OF-NODE(current-search-state) in POS(CURRENT-INPUT(current-search-state)) then PUSH(LEXICAL-POP(current-search-state), agenda) else current-search-state <- POP(agenda) else expansions <- EXPANSIONS(CAT-OF-NODE(current-search-state)) current-expansion <- POP(expansions) current-search-state <- EXPAND-CURRENT-NODE(current-search-state, current-expansion) agenda <- APPEND(expansions, agenda) end
Python Version		Assume a Parser class. Full-blown code is given here. In the following code many of the function definitions begin with long multi-line strings. These are Python self-doumentations strings. You can look at the documentation of ANY python method or function by look at its __doc__ attribute. For example, assume the definition of recognize_string below, and assume it is a method defined for some class Parser, and that "p1" is an instance of that class. Then I could look at the documentation of recognize_string as follows: >>> print p1.recognize_string.__doc__ {string}: A string. The string to be parsed by current goals. Immediately converted to a [WordList].by self.legal_input. : The result of calling recurse_recognize on the start-cat, the empty agenda, and the wordlist generated from string. Now lets look at the code of a simple top down recognizer (note: the code below differs in minor details from the code in the full-blown implementation, mostly in calls to trace functions). def recognize_string(self, string=''): """ {string}: A string. The string to be parsed. Immediately converted to a [WordList] by self.legal_input. [Return]: The result of calling recurse_recognize on the start-cat, the empty agenda, and the wordlist generated from string. """ if self.legal_input(string): return self.recurse_recognize([self.start_cat],[],self.input) else: return False def recurse_recognize(self,goals,agenda,wordlist): """ {goals}: [GoalList]: A list of categories and/or words proposed to cover wordlist {agenda}: A list of [ParserState]s: each a pair of ([GoalList],[WordList]) {wordlist}: a list of words. [Return]: True iff {goals} and {input} are empty Suppose {goals} is non-empty: If {goals}[0] is a nonterminal, call expand (expand it with the grammar and continue parsing recursively). If {goals}[0] is a terminal then call_match_word (try to match the next word and continue parsing). Both expand and match_word contain recursive calls to recurse_recognize. Suppose {goals} is empty: Then if {input} is empty, [Return] True. Else backtrack (If {agenda} is non-empty continue parsing with the next state on the agenda; else [Return] False) backtrack also contains a recursive call to recurse recignize. """ if len(goals) > 0: next_goal = goals[0] if not next_goal in self.terminals: return self.expand(next_goal,goals[1:],agenda,wordlist) else: return self.match_word(next_goal,wordlist,goals[1:],agenda) elif len(wordlist) == 0: return True else: return self.backtrack(agenda) def expand(self,next_goal,rest_goals,agenda,wordlist): """ Expand next_goal using grammar (self.productions) Generate new GoalList using one production and add unused productions to the agenda. Keep parsing with new GoalList, new agenda, old WordList. """ productions = self.productions[next_goal] next_production = productions[0] rest_productions = productions[1:] for p in rest_productions: agenda = [ParserState(p+rest_goals,wordlist)] + agenda return self.recurse_recognize(next_production+rest_goals, agenda, wordlist) def match_word(self,next_goal,wordlist,rest_goals,agenda): """ {next_goal} is a terminal. Suppose {wordlist} is non-empty: If next_goal matches {wordlist}[0], keep parsing with {rest_goals}, {agenda} and {wordlist}[1:]; else this parse path fails; call backtrack (if {agenda} is non-empty, continue parsing with the next parser state on {agenda}, and if {agenda} is empty, [Return]: False) Suppose wordlist is empty This parse path fails. Call backtrack """ if len(wordlist) > 0: if next_goal == wordlist[0]: ## We just matched a word! Discard and keep parsing rest of wordlist return self.recurse_recognize(rest_goals,agenda,wordlist[1:]) else: return self.backtrack(agenda) else: return self.backtrack(agenda) def backtrack(self, agenda): """ Suppose {agenda} is non-empty: Then agenda[0].GoalList = new goals agenda[0].WordList = new wordlist Keep parsing with new goals and new worldlist and popped agenda. Suppose {agenda} is empty return False """ if len(agenda) > 0: return self.recurse_recognize(agenda[0].GoalList,agenda[1:],agenda[0].WordList) else: return False

Pseudocode
Version

Pseudocode for topdown parser.

Note: This version has lots of errors. Please check Text errata for corrections:

Pesudocode (my version, disagrees even with corrected version):

function TOP-DOWN-PARSE (Input, Grammar) returns a parse tree
  agenda <- (Initial S tree, Beginning of input)
  current-search-state <- POP(agenda)
  loop
    if SUCCESSFUL-PARSE?(current-search-state) then
       return TREE(current-search-state)
    else
       if CAT-OF-NODE(current-search-state) is a POS then
                if  CAT-OF-NODE(current-search-state) 
                    in POS(CURRENT-INPUT(current-search-state)) 
                then
                 PUSH(LEXICAL-POP(current-search-state), agenda)
                else
                   current-search-state <- POP(agenda)
       else
           expansions <- EXPANSIONS(CAT-OF-NODE(current-search-state))
           current-expansion <- POP(expansions)
           current-search-state <- 
               EXPAND-CURRENT-NODE(current-search-state, 
                                  current-expansion)
           agenda <- APPEND(expansions, agenda)
  end

Python
Version

Assume a Parser class. Full-blown code is given here.

In the following code many of the function definitions begin with long multi-line strings. These are Python self-doumentations strings. You can look at the documentation of ANY python method or function by look at its __doc__ attribute. For example, assume the definition of recognize_string below, and assume it is a method defined for some class Parser, and that "p1" is an instance of that class. Then I could look at the documentation of recognize_string as follows:

>>> print p1.recognize_string.__doc__

        {string}: A string.  The string to be parsed by current goals.
                Immediately converted to a [WordList].by self.legal_input.
                
        : The result of calling recurse_recognize on the start-cat,
        the empty agenda, and the wordlist generated from string.

Now lets look at the code of a simple top down recognizer (note: the code below differs in minor details from the code in the full-blown implementation, mostly in calls to trace functions).

    def recognize_string(self, string=''):
        """
        {string}: A string.  The string to be parsed.
                  Immediately converted to a [WordList] by self.legal_input.
                
        [Return]: The result of calling recurse_recognize on the start-cat,
        the empty agenda, and the wordlist generated from string.
        """
        if self.legal_input(string):
            return self.recurse_recognize([self.start_cat],[],self.input)
        else:
            return False

    def recurse_recognize(self,goals,agenda,wordlist):
        """
        {goals}: [GoalList]: A list of categories and/or words proposed to cover
                             wordlist
        {agenda}: A list of [ParserState]s: each a pair of ([GoalList],[WordList])
        {wordlist}: a list of words.
        [Return]: True iff {goals} and {input} are empty

        Suppose {goals} is non-empty:
        
        If {goals}[0] is a nonterminal, call expand (expand it with the grammar
        and continue parsing recursively).  If {goals}[0] is a terminal
        then call_match_word (try to match the next word and continue parsing).
        Both expand and match_word contain recursive calls to recurse_recognize.

        Suppose {goals} is empty:

        Then if {input} is empty, [Return] True. Else backtrack (If {agenda} is non-empty
        continue parsing with the next state on the agenda; else [Return] False)
        backtrack also contains a recursive call to recurse recignize.
        """
        if len(goals) > 0:
            next_goal = goals[0]
            if not next_goal in self.terminals:
                return self.expand(next_goal,goals[1:],agenda,wordlist)
            else:
                return self.match_word(next_goal,wordlist,goals[1:],agenda)
        elif len(wordlist) == 0:
            return True
        else:
            return self.backtrack(agenda)

    def expand(self,next_goal,rest_goals,agenda,wordlist):
        """
        Expand next_goal using grammar (self.productions)
        Generate new GoalList using one production and add unused productions
        to the agenda. Keep parsing with new GoalList, new agenda, old WordList.
        """
        productions = self.productions[next_goal]
        next_production = productions[0]
        rest_productions = productions[1:]
        for p in rest_productions:
            agenda = [ParserState(p+rest_goals,wordlist)] + agenda
        return self.recurse_recognize(next_production+rest_goals, agenda, wordlist)

    def match_word(self,next_goal,wordlist,rest_goals,agenda):
        """
        {next_goal} is a terminal.

        Suppose {wordlist} is non-empty:
        
        If next_goal matches {wordlist}[0],  keep parsing with {rest_goals}, {agenda}
        and {wordlist}[1:]; else this parse path fails;  call backtrack (if {agenda}
        is non-empty, continue parsing with the next parser state
        on {agenda}, and if {agenda} is empty, [Return]: False)

        Suppose wordlist is empty

        This parse path fails.  Call backtrack
        """
        if len(wordlist) > 0:
                 if next_goal == wordlist[0]:
                    ## We just matched a word! Discard and keep parsing rest of wordlist
                    return self.recurse_recognize(rest_goals,agenda,wordlist[1:])
                 else:
                    return self.backtrack(agenda)
        else:
            return self.backtrack(agenda)
        

    def backtrack(self, agenda):
        """
        Suppose {agenda} is non-empty:

        Then
          agenda[0].GoalList = new goals
          agenda[0].WordList = new wordlist
        Keep parsing with new goals and new worldlist and popped agenda.

        Suppose {agenda} is empty

        return False
        """
        if len(agenda) > 0:
            return self.recurse_recognize(agenda[0].GoalList,agenda[1:],agenda[0].WordList)
        else:
            return False