Language generation deals with the problem of how to get a computer to translate a semantic representation into statements in natural language. This might be part of a natural language interface to a data base, or as part of a system who's interface is through a telephone.
In general, the task can be described in terms of:
There is a continuum along which language generation can be described. 'canned' language and form letters have been around for a long time. They are easy to implement, but are very specific to their application. At the other end of the spectrum is a general-purpose conversational agent.
While analysis of language is best characterized as a process of inference, language generation is best characterized as a process of planning
Text planning involves:
Text planning involves application of the rules of discourse we discussed earlier. Recall from our discussion of discourse that it is made up of segments, which can be modeled with a stack. Associated with each segment is some goal that the speaker is trying to achieve.
Recall that discourse can be viewed in its informational aspect, or in its intentional aspect.
The illustration shows a depiction of a belief space. It contains a single consistent body of propositions. Each proposition must specify whether it is believed by the speaker, the hearer, or is shared by both parties.
Note that I, the speaker need a means of representing knowledge that the hearer has but which I do not. This is related to my ability to formulate questions.
A conversation can be thought of as an attempt by both parties to expand on the area of shared beliefs.
Text planning has to be mindful of the structure of shared knowlege, so that new statements can 'ground themselves' in it, and feed new pieces of unshared knowledge into the mix. Recall that this was an important part of TACITUS' approach to sematics
The intentional aspect of text generation attends to the goals of the speaker and hearer. These can be goals to move the boundary of common knowledge to include more of the speaker's knowledge (by informing the hearer), or to move the boundary on the hearer's side (by asking him or her a question); it can also involve attempts to get the hearer to do something, etc.
The intentional aspect of discourse and dialog is crucial, because goals give us a way of evaluating the expected outcomes of our plans.
Plans can be constructed out of the following components:
An example of a plan might be called 'find out by asking':
Allen (1995) describes a basic plan for a conversational agent using a 'BDI' (for Belief, Desires, Intentions) model.
Desires are a set of (possibly conflicting) desirable states. Because we're talking about a computuer, these have to be measurable somehow. Examples might be to keep the number of pending questions to a minumum, to reduces cases where the computer's beliefs are known to differ from the hearer's, etc.
Beliefs are the contents of belief spaces discussed earlier. In the course of planning, and conversing new facts may be added to the belief space.
Eventually the agent must choose a plan which it expects to maximally satisfy its desires, at which point it must commit to that plan and act on it by performing speech acts.
This process is repeated continuously.
The text planner is responsible for providing a sequence of 'sentence chunks' to the sentence planner. The task of the sentence planner is to produce a surface realization.
Proceding from the logical forms to the surface realization favors an approach that takes a bottom-up, 'head-driven' strategy. This allows us to choose the main predicate of the sentence first to serve as the head then set subgoals to plan for generating the complements.
Feature-based unification grammars, such as the categorial grammars we discussed earlier, serve such approaches well.
Consider the definition of 'the':
Unification grammars are also reversible. They can be used in parsing as well as generation.
Say our text planner has carefully considered the discourse situation, and decided that my next goal is to inform you of the following beliefs:
We start with this lexical item as the head:
From there we can set subgoals to express n1, and we find the word 'dog', which desribes X, but is [- det], so we set a further subgoal to make 'dog' [+ det]. 'The' won't work because the hearer doesn't know by dog (not part of our shared knowledge), but 'My' would work, because the hearer knows ME.
To express n2, we find that one sense of the word 'fleas' references the condition of being infested with some set of fleas, so we can use that.
The resulting composition: 'My dog has fleas.'
In choosing 'has' and each of the other words, we have addressed the problem of lexical selection. Note that we could have chosen to use the verb 'afflict', or 'infest', which would have presented us with a different set of constraints to plan for.
Speech synthesis involves taking a sentence representation and generating a wave form which sounds like a human voice.
In many respects this can be done as a straightforward reversal of the voice recognition task we addressed earlier in the year. Phonemes are broken down into smaller segments, each of the smaller segments can invoke a synthesized waveform.
Where speech synthesis presents a greater challange is in the area of 'prosody', which is to say getting the intonation right; placing the accents in the right place.
A simple approach is to simply emphasize the content words, and deemphasize the function words. This doesn't explain:
Some function words are more likely to be accented than others: 'no, not, don't, who.
Other function words tend to be 'cliticized', or folded into the words next to them like the 'to' in 'wanna', but also 'the', 'a', etc.
Other features such as position in the sentence/phrase are also seem to be important.
A Classification and Regression Tree (CART) can be used to provide a cascade of tests, each test referring to a specific feature of the word, each decision point chosen to mimimize the error with respect to a training set.