[ Up to Computer Parsing with a CFG]
Parsing means dividing and subdividing the parts of the sentence into their components, and describing the relationships between these parts.
Presumably the better to analyze the meaning.
Typically, this results in a 'tree structure', with the root on top.
Along with stacks and vectors, 'trees' are a frequently encountered class of data structures used in computer science.
[ Up to Example: a simple CFG which accounts for a 'corpus' of three sentences.]
[ Up to Example: a simple CFG which accounts for a 'corpus' of three sentences.]
S -> NP VP
NP -> N
NP -> Det NP
NP -> Adj NP
VP -> IV ['intransitive verb']
VP -> VP PP
VP -> BV Adj [BV = 'Be verb']
VP -> MV S [MV = 'mental verb']
VP -> TV NP [TV = 'transitive verb']
PP -> Prep NP
N -> John
N -> Mary
N -> store
N -> beans
N -> they
N -> he
IV -> went
MV -> thought
TV -> bought
BV -> were
Prep -> to
Det -> the
Det -> some
Adj -> green
Adj -> expensive
[ Up to Computer Parsing with a CFG]
Every CFG stipulates that 'S' is the 'top', just as every finite-state machine must have a 'start state'.
We can start with the 'S', expand each of the rules that have 'S' on the left hand side, (say S->NP VP), and work our way through the NP->... rules, and VP->... rules until we get to rules involving 'terminals', i.e. actual word forms.
We could accomodate this sentence with a very simple CFG:
S-> Excl
Excl -> wow
[ Up to A trivial example: 'Wow.']
First we apply a pre-parser to the text, and mark the points on either side of each token in the sentence with indexes: '[0]wow[1]'.
[ Up to We can parse 'Wow.' top-down with a machine thus:]
We find the rule that has 'S' on its LHS: 'S->Excl'. (We push a goal to assert that rule as our working hypothesis, with the starting index of 0 on our stack.
That rule indicates that we'll have a sentence if we can prove that the whole sentence starting at index 0 is an exclamation. We'll indicate that we're current looking for an Exclamation by putting a little dot next to it.
[ Up to We can parse 'Wow.' top-down with a machine thus:]
So now we look for a rule which matches whatever category is currently marked with a dot on the top of our stack, which is 'Excl -> wow'. We push that on our stack.
Since 'wow' is a terminal (i.e. not a syntactic category), and we've indicated that we're at index 0, we check index 0 for 'wow', and find a match!
[ Up to We can parse 'Wow.' top-down with a machine thus:]
Since we've found a match, we indicate that we found the match from indexes 0 to 1, and pop the rule off our stack, putting it in a table of hypotheses we've checked. Let's call it 'Table entry A'.
Having just popped the 'Excl -> wow.' goal, we've exposed our old 'S->.Excl goal, and since that that hypothesis we just popped was successful, we can record a link between the 'Excl' part of this rule, and Table entry A. Then we shift the dot past the 'Excl' symbol, and since we've reached the end of this rule, we can put in 'Table entry B', with index from 0 to 1, pop the goal and we've parsed the whole sentence.
[ Up to We can parse 'Wow.' top-down with a machine thus:]
Note that since we've drawn a link between table entries B and A, we can reconstruct the parse tree of the sentence.
[ Up to We can parse 'Wow.' top-down with a machine thus:]
Wasn't it convenient that the first rules we checked happened to have led us to the right parse?. If the sentence happened to be 'Zounds!' instead of 'Wow!'(and our grammar had a richer vocabulary), we'd have to mark the 'Excl->wow' goal as a failure for index 0 in Table entry A, and then checked, say Excl -> groovy, Excl -> whoops, and so on, marking each of them as failures in our table until we came to the rule that goes 'Excl -> zounds'.
Now say that the sentence was 'John got a bad deal.' We'd eventually try every rule of the form 'Excl ->{whatever}', and having done that, give up on the hypothesis in our stack that says 'S->Excl at index 0'. Again, this means keeping that hypothesis in our table and marking it as a failure.
First we run a preprocessor over the text and mark the positions between words with indexes:[0]Jesus[1]wept[2]. These indexes allow us to keep track of which token we're dealing with in the sentence.
Let's say that now our grammar has these rules:
S -> Excl
Excl -> wow
S -> NP VP
NP -> N
N -> Jesus
VP -> IV
IV -> wept
The illustration at right shows the state of the parse just after we've found that the token wept, which occurs in our sentence between indexes [1] and [2], matches the rule 'IV->wept'.
Note from entries A and B in the table (in red) that we've tested the rules that worked previously for 'Wow.', and recorded them as failures.
Now the rule at the bottom of the stack is 'S -> NP VP', and from entries C, and D, we've fit those rules to the first word of input [0]Jesus[1]. We've recorded table entry D in the members column, and shifted the dot forward by one to 'VP'.
The rest of the stack is testing 'VP -> IV', and 'IV -> wept' in turn.
Entry E shows that we've just written the rule at the top of the stack to the table as a success, and we're now ready to pop.
[ Up to Computer Parsing with a CFG]
Conversely, we can start with the rules that have 'terminals' on the right-hand side, (say 'N -> John'), and work our way up upward through rules like say, 'NP -> N', until we end up with an 'S'. This is called bottom-up parsing.
We can demonstrate a simple bottom-up parse on the sentence 'Wow.', by maintaining a table with one column for each word in the sentence, and one row for each word in the sentence (one word = one table entry).
We start by adding to the table entry the rule that matches the token [0]wow[1]: 'Excl -> wow'. Then we move the dot to the end of the rule, and label it A.
Now we look for rules for which the first category on the RHS is the category we just recognized: 'Excl'. We have such a match in 'S -> Excl'
Since we've recognized 'Excl', we can move the dot past that symbol, which happens to be the end of the 'S -> Excl' rule. We label it B, and we're done.
Since our sentence has two tokens in it, we use a table with two columns.
As the black arrows indicate, the table entry on top will be put in charge of recording matches between all rules that occur within boxes along its left diagonal, and in the column directly down from it.
Just as in the 'Wow.' example, we record all the matches that proceed from 'Jesus' in the bottom table entry of the [0]Jesus[1] column. Note that in this column we can only move the dot in 'S -> NP VP' past 'NP' to 'VP', so it's not complete yet.
We do the same thing in the bottom entry of the [1]wept[2] column.
Now the top table entry which is responsible for all matches between indexes [0] to [2] finds and records a VP match for 'S -> NP VP', copies the rule over (as indicated in green), and assigns D as its second member.
[ Up to Computer Parsing with a CFG]
What should be clear after looking at these simple examples is the fact that this is an automated search of the space defined by the rules of our grammar.
As the number of rules grows, an the sentences get longer, the complexity of the search space grows.
The machine cannot know in advance which rules will work, so it must try them one at a time. The order in which rules are tried in the course of parsing makes a big difference to the efficiency of the process. So, the most frequently used rules should be tried first.
In order to try to simplify the search space, some parsers will remove certain rules from consideration when it can be determined that they are unlikely to be used. This is called pruning.
There is evidence that humans may be doing something like pruning in their own processing of sentences. Take for example:
This is an example of a 'garden path' sentence, since it leads the listener to commit to thinking that one type of pattern is being used, when in fact is is another pattern that is being used.