Perhaps the natural language application most familiar to the average user of the Internet is that of text retrieval. This is what goes on whenever you use a search engine such as Alta Vista, Hotbot, etc. Here we explore the underlying space, data structures, and algorithms that pertain to text retrieval, particularly as they pertain to web search engines.
The basic process is simple:
The site http://northernwebs.com/set/ has an interesting tutorial directed at web designers surveying some of the major search engines, and the various policies they use in indexing texts and making relevance rankings of documents.
In order to retrieve text documents, each document must be entered into a database. Documents on the World Wide Web are typically written in a markup language called HTML. A markup language contains a number of tags whose function is to tell computer programs important details about how to process the text. When your browser views a Web page, it reads the markup and displays its contents appropriately.
If you're using the browser to view an HTML page, the menu command to 'view source' will show you the HTML underlying the document.
Some tags in HTML are provided especially for search engines. There is a 'keyword' tag which specifically enlists keywords, and there is a 'description' tag which is intended to be displayed by search engines when posting it as a response to query.
HTML stands for HyperText Markup Language. Hypertext is text which can be linked to other text. Each Web page is identified by a kind of address known as a uniform resource locator, or URL. HTML indicates a link between pages by using a tag which associates some span of text with a particular URLs.
Search engines typically work by employing a program known as a spider to automatically follow links between pages, indexing each document as it finds it. Some 'deep' spiders will follow links as far down from each site's home page as it can, while some are 'shallow' spiders, which only depart from the home page a few levels.
When html documents are entered into a database, the contents of the markup language has to be filtered out, and what remains used as the basis for indexing the document.
To index a document means to establish 'pointers' between a set of terms and target documents in the database. Typically, indexing a text document involves these steps:
Apply a stemmer to the word, so that all occurances of the same lemma map to the same term. (e.g. 'walking', 'walked', and 'walk' all get tranlated to 'walk') Here's where morphological analysis comes into play.
Match each word token in the text against a stop list of words to be ignored. (What kind of words would they be?).
Use instances of the remaining word forms as indices.
Possibly, treat the contents of meta-tags, titles, and other tags differently.
A text database typically maintains an inverse file, which maps word forms to the documents that contain them. So the entry for 'rainbow' in an inverse file contains a listing of every document in the database which contains the word 'rainbow', along with the number of times that word occurs in the document, and any other data will be used in evaluating the document's association with the term.
The underlying space here is a large vector where every content-bearing lemma in the language defines one dimension. Each document in the database can be given a location in the space described by this vector. Each query can also be mapped onto this space.
One thing which has to be taken into account is the frequency with which different word forms occur in text. Some content words are very common, and other content words are extremely infrequent. An underlying assumption usually applied in text retrieval is that the less frequent the word, the more important it is to the meaning of the document. Also of course the number of times the word is used in the document should also reflect that word's importance to the meaning of the document.
Statistically, we can calculate for any document of a given length the expected number of occurrences for any given word form. We can compare the number of times that word actually does occur to the expected frequency, and derive a value which reflects the importance of that word to the meaning of the document.
Other factors can increase the importance of a given term to its containing document:
Since we're using each word form as a position in a vector, we can take this value for each content word in the document and map it onto the vector space. This means similar documents will be close to each other in this space. We can do the same thing with queries (essentially treating each query as a short document).
Using this model, retrieving a document means finding documents in the database which are closest in this vector space to the point mapped for the query.
Since the query is usually short, terms in the query can also be expanded to include synonyms from an on-line thesaurus (like Wordnet).
Other factors that may boost the relevance rating for given URL are
Performance on information retrieval is generally measured in terms of recall and precision.
Note that there is a tradeoff here. We could achieve 100% recall is we just retrieve everything. We could approach 100% precision if we only retrieve what we have a very high confidence for.