Note: this material draws heavily on a guest lecture delivered last year by by Dr. Paul Hays
Prominent authors in this line of inquiry:
The primary assertion here is illustrated at right.
Meaning is conveyed by text, which is built on understandings of how words are used.
Any attempt to cause meaning with language is a text. Shopping lists, bumper stickers...
Phrase structures, pragmatic rules, phonological rules are de-emphasized.
[ Up to Lexis, collocation, and corpus studies.]
This is harder to pin down than one might imagine.
In general we talk about word forms, which (in written English) is a particular configuration of letters wherever it occurs. Each individual occurance of a word form is called a token.
As an example, let's say: 'water'. There are related word forms:
water
waters
watered
watering
watery
This set of word forms is called a lemma.
It is not meaningful to ask whether 'water' is, say a 'noun' or a 'verb', it can be either, depending on how it is used.
It is argued that the best way to find how 'water' is used is by studying many cases of actual use, and abandoning ones prejudices about generalities such as syntactic categories.
[ Up to Lexis, collocation, and corpus studies.]
Many of the ways that say, 'water' can be used are subtlely linked to very specific situations, and to other words.
An example:
His mouth watered.
His eyes watered.
But this paradigm doesn't extend to 'watering'
The roast was mouth-watering.
*The [smokey] nightclub was eye-watering.
Special relationships between words which tend to be used together (or not) are called collocational constraints.
It has been demonstrated that children learn the forms of say, 'water' as seperate items. Only when we go to 'grammar school' are we given training in recognizing nouns and verbs, and formalizing rules about language.
Many units of meaning extend over several 'words', for example 'kick the bucket', whose meaning cannot be derived from its parts.
[Many of the] sentences we utter [can be thought of as] [having a substantial component of] [frequently used] phrases. Again, this can be thought of in terms of collocations, where these groups of words reflect frequently encountered contexts in the language communtity.
[ Up to Lexis, collocation, and corpus studies.]
Computers can be used to study collocation in large text corpora. (Recall that a corpus is a large body of text (usually on the order of millions of words) which are compiled as an attempt to characterize a linguistic behavior of a particular language community).
Some corpora available online:
Typically this means selecting a word form type which one wants to study, which will serve as the node. Each occurance of the type is a token. We then select a span within which we want to study co-occurance of other words. Most of the significant relationships are within a span of + or - 4.
Whenever a token of our node word occurs, we tally each of the tokens of other words which occur within its span. Where there are many occurances of the node word, a statistical profile of that word's collocates starts to emerge.
This works a lot better for content words than it does with function words.
[ Up to Computer analysis of collocation.]
A simple count of words which occur within the span would yeild a lot of function words.
What you want to do is take into account the expected frequency of each word.
A simple measure of this is to divide total number of occurances in the corpus by the total number of words in the text.
[ Up to Measures of Collocation]
O = observed frequency.
E = expected frequency.
Measures the amount of information which the occurance of one word tells about the other.
[ Up to Measures of Collocation]
O = observed frequency.
E = expected frequency.
Measures how much the collocate is influenced by the appearence of the node.
[ Up to Computer analysis of collocation.]
last week. Then he added, with a ??????: 'And, of course, we all know there
good measure. The audience did not ?????? (as they were to do when Peter O'
him, nodded at the words. A blurting ?????? came from the youth by the dresser
a day; but he was equally quick to ??????, loosing a sharp, distinctive bray
to Israel has produced a nationwide ?????? of contempt. [p] For the
gets underway without Rob Lowe ?????? selected highlight for Edinburgh
back home have elicited one major ??????. The reporter from Ireland's Sunday
for an evening He bursts into a loud ?????? The first time I got a jab, I said
way of talking is tempered by a ?????? which he deploys, to great effect,
jokes which make people giggle and ?????? while his knowing eye roved the
[ Up to Computer analysis of collocation.]
too much telly Martin said with a ??????. [p] How would you like it, Dad, if
public image. He likes a drink and a ?????? [p] Derby, champions twice in the
and the next they were having a good ?????? [p] Dick denied suggestions there
and the others were game for a ??????. [p] By showing at least three
a short, black man, protested with a ??????. [p] The man called Cross turned
rice they've harvested, and they all ??????. [p] Unidentified Woman #2:
but identity, achievement, friends a ??????, a better day than one spent
really very very serious but we can ?????? about it. Things can be really f ed
as I remember. We always used to ?????? about the time when you were three
myself as an avenging angel. Then I ?????? aloud at my own audacity and admire
[ Up to Computer analysis of collocation.]
the right to show the property [f] ?????? [f] the original listing
on their knee, that they can talk ??????. [o] The cover is that we're `
Minister has been appointed to push ?????? a raft of unpopular reforms,
peach of a right hand that travelled ?????? a 20-year time warp. [p] Set
stone crab and fresh shrimps, sold ?????? all manner of outlets from
during China's cultural revolution ?????? Anchee Min's vividly portrayed
which would take time to come ?????? and would provoke a political
flight from here to Charleston, ?????? Chicago, I think. So I'll be home
a well-worn path from hopefulness, ?????? disillusion, to the centre-piece
[ Up to Computer analysis of collocation.]
that local officials were due to ?????? action against Mafia gangs. Our
proposes to run the Cabinet, ?????? activities of leaders of the
of their development plans; [p] ?????? activity relating to the SEM; [p]
trade and monetary problems and ?????? aid to the developing world. [p]
Arab states. The aim is to ?????? all their stands in preparation
whose headquarters is in Rome, to ?????? all future famine relief efforts
how the district's congressmen can ?????? all the local efforts to attract
headquarters in Langley, ready to ?????? any actions that might be
own inter-republic commission to ?????? efforts to revive the dying sea.
Macaraig, who is supposed to ?????? energy affairs, but he is coming
park ranger Eileen Martinez to ?????? film permits. Her salary is paid )
[ Up to Computer analysis of collocation.]
were discussed at every stage,' says ??????. [p] In all it took about 12 months
with a subs' spot. [p] The enigmatic ?????? Barnes is likely to retain his place
in the Run, since her late husband, ??????, bought a vanload of bits for
Times, says not necessarily so. [p] ?????? Brennan (Los Angeles Times Bill
by papal Master of Ceremonies, ?????? Burchard, in his detailed and
She'd been through a lot. [p] ?????? came home with a five-year nightmare
the doyen of British sci-fi writers. ?????? Clute, editor of The Encyclopaedia
punter in toilet He's great that ?????? Dasilva I heard he learnt it all off
ago has curbed violence. [p] Cochran: ?????? Devitt filed that report for the
her attention to anyone: too eagerly, ?????? felt, though in his annoyance he had
[ Up to Computer analysis of collocation.]
In this example, Dr. Schutze downloaded 50 million words from the NY Times. He filtered out the function words, and chose several thousand words to serve a nodes, with 'large' span.
Download Dimensions of Meaning by H. Schutze (It's near the bottom of the page).
You are encouraged, but not required to read the article.
What is important for purposes of this course is illustrated in:
figures 1 and 2, which demonstates a simple 2-d version of the 'space' that can be derived from collocation.
Figure 3, which shows a much simplified version of the semantic 'space' which was derived from applying this technique with thousands of dimensions.
Table 3, which shows several words and its nearest 'neighbors' in that space.
[ Up to Lexis, collocation, and corpus studies.]
Here's an example of text from a tagged corpus.
Here is a complete tag set used to tag a corpus.
If a corpus is intensively analyzed by humans, n-gram data can be taken and applied to the task of determining probable tags or words in new text.
If you just take the most frequent part of speech for each word, you already have about 90% accuracy.
Accuracy can be increased by taking the text a sentence at a time, using a markov model to assess the probability of each possible assignment of tags.
The calculation for each word transition is a function of 1) the probability that the word is assigned to each category and 2) how strong the links are between nodes in the model.
Two (or more) competing sequences of tag assignments for a given sentences can be compared by taking the product of values for each transition, taking the largest overall score as the best estimate.
For more information, look here