This section describes efforts to write computer systems which can read text and 'understand' its meaning. This can be abbreviated NLU, for Natural Language Understanding. We will emphasize two examples of work done by participant in the 'MUC' (Message Understanding Conference) series of research programs.
One such system was TACITUS (Hobbs et al), which takes the approach of being very selective about which sentences it chooses to parse, parses these sentences intensively bottom-up, then builds a structure of logical propositions, filling in gaps where necessary by referring to a knowlege base of world knowledge.
Another approach is exemplified by the GE NLToolset (Jacobs & Rao). They use shallow parsing with a pre-processor in conjunction with a top-down parse, then use a flexible pattern matching scheme to match the input to roles in script-like templates.
[ Up to Machine Understanding and Data Extraction]
These events brought together several research organizations, each of which applied its own a approach to a common task. In each case the task involved taking as input messages from a common domain, and producing templates whose fields indicated various pertinent aspects of the event. The earlier conferences concentrated on identifying and generating templates for incidents of terrorism in Latin America, using input from English-language news reports. Later conferences dealt with merger and aquisition events from business newswires.
In the case of terrorist incidents, the kinds of templates generated involved the type of incident (say, 'bombing'), the time and place, the number and nature of casualties, the identity of the perpetrators, etc.
For each conference there would be a large (1300 messages) training set, which each participant would use as the basis for developing their system. Then there would be a final testing round involving a hundred messages. Each participant would then be scored, and the results compared to those of competing systems.
Performance was measured in terms of recall and precision.
Note that there is a tradeoff here. We could achieve 100% recall is we just retrieve everything. We could approach 100% precision if we only retrieve what we have a very high confidence for.
Good performance in these conferences tended to be in the 50-70% for both precision and recall.
[ Up to Machine Understanding and Data Extraction]
The point of the pre-processing phase is to use fast finite-state processes to do as much of the work as possible.
Dealing with unknown words is always a problem with free text.
TACITUS applied a set of tests in sequence:
First, they would try to determine whether a spelling error was made.
Then they would apply a trigram model (of letters) of Spanish surnames to the word.
If that didn't work, they'd try to find morphological variants of known words.
Finally, they'd just assume it's a noun.
Typical sentences in actual newpaper articles are 25-30 words long, which makes the search space for parsing quite complex, and parsing every sentence in a message completely is widely regarded as impractical.
Many NLU systems use only 'shallow parsing' to deal with the problem of complexity of parsing. TACITUS approached the problem by being very selective about which sentences it parsed, but when it parsed, it tried to do so completely.
To find the most relevant sentences, they built a statistical relevance filter, which returns a relevance score for each sentence in a message.
They started by identifying relevant sentences in training data.
From the set of relevant sentences, they extracted 1-, 2-, and 3-grams which were good predictors of relevance (i.e. they had a much better chance of occuring in relevant sentences.
Each new sentence could then be tested quickly for how many of these n-grams it contained, and recieve a score.
Each sentence is then compared to the average score for sentences within the same message to find the most relevant ones.
Standard methods (remember n-grams?) were used to tag parts of speech.
Text is segmented to as great a degree possible using finite-state techniques.
Certain keywords will trigger the activation of templates.
The authors don't go into detail as to how text is segmented, but a well-known bracketing technique involves looking for cues which mark the beginnings or ends of phrases. Examples are punctuation, determiners like 'the' (which begin noun phrases), and prepositions (which begin prepositional phrases); they can give strong hints as to where brackets may be placed within a sentence, and reduce parsing complexity.
Certain kinds of domain-specific knowledge can also be extracted by the bracketing, for example company names may be very important, and are relatively easy to extract. They end in words like 'Corp. Inc.', etc.
Based on 'Trigger' words. For example 'rejected' proved to be a very good statistical predictor (on business newswires) of a corporate takeover.
A corporate takeover template would largely be defined by a set of roles such as the suitor, the target, share prices, shareholders, etc. It would also have subevents like offers, acceptance, rejections, etc.
Certain words can be identified as 'pivots', which give a clear clues as to likely fillers for some of these roles, and these can be identified by the preprocessor.
Template activation activates a dynamic lexicon of words which are specific to that template. These are added to a medium-sized core lexicon which holds only the most common senses of word forms. This keeps rare senses of words out of the way until they are needed.
As an example 'engage' has a specialized military meaning which should only be referenced if it is clear that military activities are being discussed.
[ Up to Machine Understanding and Data Extraction]
The output of the preprocessor is usually marked up in some way, and certain domain-specific knowlege structures in the program are usually activated at this point. Parsing of real text can be quite complex, and it is by no means certain that every parse will have succeeded completely, so text understanding systems have to be prepared to make the best guesses they can about the structure of each sentence, and provide the best hints possible to the interpetation modules of the program.
There are 160 phrase structure rules.
Each rule has a 'constructor' which expresses constraints (such as part of speech) and a 'translator' which expresses the semantics of the phrase as expressions of predicate logic (e.g. 'there exists an x such that x is a bird'...)
Parsing is done bottom up. Bottom-up parses tend to be more complex, but allows you to come up with partial parses when the parse fails.
The parse is pursued 'best first' according to a set of heuristics (rules of thumb) compiled through lots of experience parsing.
Recall that bottom-up parsing is done with a parse table. Pruning is done by limiting the number of candidates in each cell of the parse table for each basic syntactic category.
In cases where the whole sentence fails to parse, the longest, best (as in 'best first', above) sequence of fragments.
Very long sentences (some are as long as 60 words) are broken up along punctuation and certain words (like 'which'), and parsed seperately before merging.
Parsing is done top-down, but the bracketing (done by the preprocessor) provides a number of medium-range local parses, so if the parse fails, it is still possible to get a partial parse.
The parser provides the relation-driven control module with a seet of possible relations that apply within the text, and gets back a preference score for each.
Semantic interpretation is done largely by filling the roles for the events and sub-events which pertain to active schema templates. This is done on the basis of preference scores provided by the relation-driven control module.
Sense discrimination is also done at this point, again in light of preference scores.
[ Up to Machine Understanding and Data Extraction]
Interpretation is not necessarily done after the parsing process. Often some semantic interpretation goes on in parallel with parsing.
The output of the parser is a set of propositions (eg, 'there exists x, and x is a train, and there is exists y, and y is dynamite, and x hit y and x derailed.'
The job of the semantic interpreter is to make inferences which fill in the gaps about what is implied, for example that dynamite explodes, and that explosions can cause damage, and the derailment is a kind of damage.
TACITUS does this by referring to a knowledge base of world knowledge. Where there are islands of propositions (like a statement about derailment and another about an explosion), the semantic interpreter tries to string together assumptions from the knowledge base which draw connections between them. These assumptions are weighted by salience (strength of activation), and the explanation which makes the fewest, most salient assumptions is used to fill in each gap.
The illustration indicates the three entities which define some relation that has been suggested by the parser. The HEAD indicates that the overall event is 'acquire'; the ROLE being considered is the [target] of the aquisition (the company being acquired); the FILLER in the example is 'ACME Corp.' So the relation-driven control module is being asked to consider: how much should we commit to assigning 'ACME Corp' to the role of [target] in an acquisition event?
It does this by evaluating three component questions:
The module assigns a numeric value to each of these questions, and returns their weighted sum.