RE-PURPOSING
AND RE-COMBINING RESOURCES
FOR
AUTOMATIC ANNOTATION
Zheng-sheng
Zhang
San Diego
State University
The present paper
summarizes the pedagogical benefits of automatic and exhaustive annotation and
explores the optimum exploitation of existing resources for such
annotation. While dedicated annotators
are few and often flawed and new ones are hard to develop from scratch, I
propose to re-purpose and re-combine the elements of existing resources for
pedagogical use, including state of the art natural language processing tools.
1. Why automatic annotation?
First of all, why the desire for easier
annotation? The primary consideration
has to be the need to encourage greater amount of reading for students beyond
the elementary level and outside the classroom.
Secondly, annotation by computer
represents the best use of the technology, specifically its random-access
feature. The ‘when in doubt, click’
capability afforded by computer annotation is a welcomed relief from the
cumbersome task of using the Chinese dictionary.
Thirdly, annotation
presupposes the segmentation of character strings into clearly demarcated word
size units. Segmentation not only aids the
reader in locating word boundaries, it also makes for better conveyance of
meaning, since the unit of vocabulary in Chinese is the word rather than the
character.
Fourthly, using the computer for
annotation can render an otherwise tedious process easy and fast. This makes the instant preparation of
reading materials readily achievable.
Last but not least, the
ease and speed of automatic annotation makes exhaustive annotation a distinct
possibility.
2. Why exhaustive annotation?
Exhaustive annotation, made possible by
the computer, has important implications for many aspects of the learning
process.
Selective annotation, which is the norm
of traditional textbooks, presupposes clearly defined targets and close
articulation between levels. In contrast,
exhaustive annotation by definition eliminates the need for such pre-selection
and articulation.
As a consequence, the
use of reading materials becomes very flexible. The same material can be used by learners of different levels for
different purposes.
The flexibility also allows real
learner choice of pace and sequencing that is not possible with selective
annotation.
Although automatic annotation is
extremely useful both to the learner and the teacher, there are few available
annotators at the present time. We will
examine the only two such tools that I am aware of:
1. Chinese Annotation Tool (CAT henceforth) is written by Erik Petterson of the Mandarin Tools fame (www.mandarintools.com). It is hosted, among other mirror sites, on the website for the Chinese program at San Diego State University (http://www-rohan.sdsu.edu/dept/chinese/public_html/annotate.html). Available to anyone for free, it has been an invaluable resource for material development at SDSU. Following is a screen shot of the tool:

The tool is very easy to use. Text taken from a text editor or the internet (in GB or Unicode format) is pasted into the top window; a click on the Annotate! Button yields a new result screen. There are a number of choices concerning the output. The output can simply be segmented text or a pinyin version of the text; or it can be annotated with both English glosses and pinyin, which can be positioned at various locations with respect to the characters. Following is a screenshot of the result screen (assuming the third option of ‘Add Dictionary Entries at status line’)

As can be seen from the result screen, the text is segmented and given
annotation both below the original text in the form of endnotes and at the
status line as the cursor moves over the underlined words (In the screen shot
above, the cursor happens to be on the middle of the second line, which is 中国人, whose
pronunciation and definition is given at the status line). When the underlined word is clicked on, the
corresponding endnote entry will be on the top of the screen for ease of use.
The result screen can be saved as a html
file, which can then be printed out.
But the printed version will lose the instant lookup feature as well as
the link between the annotated and the annotation.
Two problems of the program are
obvious. First of all, there are some
characters that remain un-annotated, i.e., without underline. For example, 单元。They are
not annotated due to the lack of entries in the built-in dictionary. A more serious problem is seen in the中国人
example. As can be seen, the character 人should be grouped with 口 and not 中国.
This mistake in segmentation causes more harm than the failure to
annotate.
2. NJSTAR, the word processor developed by Hongbo Ni, is a
favorite among Chinese students and teachers due to its many pedagogically
useful features. Its newest version to
be released (beta version 4.5) has two more features to help the reader. One is the instant lookup feature. When the instant lookup option is enabled, a
window with glosses pops up as the cursor moves over a word or a character. The second feature is called word
annotation. A highlighted passage can
be annotated paragraph-by-paragraph and appended below the original text. The annotated file can be saved as a text
(GB, Big5, Unicode) file.

A cursory look at the result shows a number of differences from
CAT. The original text is not cut up
into word-size chunks. There is no
indication of what is annotated and what is not. In addition, quite a few words are not given annotations in the
first paragraph, i.e., 是、的
、安、对、美方、向、中方、了。There also
are no links between the annotations and the annotated words, as in the case of
CAT. In this paragraph, there is also
one segmentation mistake, i.e., 王国 is
treated as a common noun instead of part of a personal name.
III. Evaluation
of the Two Annotators:
Hence the choice between CAT and NJSTAR
is quite clear: CAT clearly wins out.
Due to the great reliance on the simple ‘maximum match’ algorithm, there are problems common to both programs, mostly concerning segmentation:
美国会—》 美国+会 or
美+国会;
才能—》 才+能 or 才能 (n.)
这个人手上有伤—》这+个+人+手上+有+伤。
—》*这+个人+手上+有+伤。
—》*这个+人手+上+有+伤。
法国人民愤怒火烧酒厂—》法国+人民+愤怒+火烧+酒厂
—》*法国人+民愤+怒火+烧酒+厂
CAT/ NJSTAR: 王国安à*王国+安:
CAT: 周恩来、毛泽东、肯尼迪; not尼克松
NJSTAR: 周恩来,not 毛泽东;尼克松,not肯尼迪
CAT: 单元、员工、民愤;
NJSTAR: 发热、发冷、发呆 but not 发傻、发愣
f. Missing
gloss: CAT: 是 as ‘this’
Obviously, to do this is beyond the average language teaching
professional.
A more effective and
efficient strategy than tweaking and radical redesign is ‘re-purposing’, that
is, using tools that are originally intended for natural language
processing. Judging from the
participants in the First
International Chinese Word Segmentation Bakeoff, which took place on 4/22-25,
2003, sponsored by SIGHAN (Special Interest Group on Chinese Language
Processing) of ACL (Association of Computational Linguistics), many natural language processing tools are
being developed by research and private institutions in mainland
China, Taiwan, Hongkong, Singapore, Japan, the United States and
elsewhere. In this section, we will
examine two freely downloadable auto-tagging programs, which potentially can be
used to replace the weaker, segmentation portion of CAT.
The
program can be freely downloaded from:
http://rocling.iis.sinica.edu.tw/CKIP/ws/.
Help is available at: http://godel.iis.sinica.edu.tw/CKIP/ws/help.htm.
An introduction by the developers to the algorithms used can be found at
The program has standard windows type interface and clearly written help file. On the main screen, there are two windows. Into the top window input text can be taken from a Big5 text file or pasted. Output is given in the bottom window. Output options include choices between having parts of speech tags or no tags and the presence or absence of line numbers and dividers. For the purpose of using the output to feed CAT, it is best to set the options to no tags and no line numbers and dividers. Output can be saved as a text file or copied from the output window and pasted into another program. If a Chinese system software is not used, then a Chinese text display program such as NJWIN (set to Big5 coding, with Language Bar disabled if Window XP is used) has to be used to see the characters. Following is a screenshot:

4. ICTCLAS (Institute of Computing Technology Chinese Lexical Analysis System).
It is developed by 张华平 and 刘群at Chinese Academy of
Sciences. Free downloading is available at http://www.nlp.org.cn/
and pages detailing its operation can be found at http://pipy_world.y365.com/Manual/ICTCLAS_API.htm
and www.nlp.org.cn/project/project.php?proj_id=6. In addition to the PC version, there is also
an online trial version at: http://mtgroup.ict.ac.cn/~zhp/ICTCLAS.htm. Unlike
the PC version, it is harder to use the results as input to CAT. So we will not discuss it any further.
Although the operation of ICTCLAS
is explained in a separate file, the interface is not as user-friendly as
WinAT. Like WinAT, there are two
windows containing input and output respectively. As with WinAT, if a Chinese system software is not used, then a
Chinese text display program such as NJWIN (set to GB coding, with Language Bar
disabled if Window XP is used) has to be used to see the characters. The interface is not the standard windows
type though. The buttons and choices between
the two windows, not clearly marked, provide output options, i.e., tags versus
no tags, more detailed tags versus less detailed ones, output display formats
and so on. There are also functions not
found in WinAT. One button allows the
processing of text files unmonitored.
The small leftmost window right above the output window allows the
choice of the number of different results in the case of ambiguous segmentation
and tagging. The other two windows
above the output window shows the time taken to process the input and the
accuracy attained. Following is a screenshot of ICTCLAS,
with the options set at no tags, only one result displayed:

5. Comparison between WinAT and ICTCLAS:
Although less user-friendly, ICTCLAS is better than WinAT because:
a. ICTCLAS provides more output options, especially the capability of providing multiple versions when there are ambiguous segmentations.
b.
It deals with ambiguous segmentation much better. For example, 美国会. Not only it makes the right decision for the
context, there is also the option of having the alternative results.
c.
It deals with unlisted items, such as names much
better. For example, 王国安.
d. It does not suffer from the mis-segmentation due to the overlapping of longer and shorter entries, such as the case of 法国人民.
To illustrate, the same test sentences, listed below, are fed first to WinAT and then to ICTCLAS. The * items are potential wrong segmentations.
这个人手上有伤。(*个人、人手)
美国会议员工作主要是非难。(*美国、会议、员工、作主、要是、是非)
克林顿对内塔尼亚胡说:(*对内、胡说;内塔尼亚胡:Natanynahu )
法国人民愤怒火烧酒厂。(*国人、民愤、怒火、烧酒)
王国安总经理代表东方进出口公司对美方表示欢迎。(*王国)


As can be seen, ICTCLAS segments the sentences exactly as a native speaker would. WinAT on the other hand is led down the garden-path by the potentially ambiguous strings in most cases. The choice therefore is quite clear: ICTCLAS provides better segmentation results.
Given the superior segmentation performance of ICTCLAS and the unique annotation function of CAT, the logical step to take is to use the segmentation result of ICTCLAS to feed CAT for annotation, in the following manner:
Input text à ICTCLAS àsegmented textàCATàannotated text out
The output text from ICTCLAS is copied and pasted into the input window of CAT. As CAT is sensitive to space and therefore will not regroup characters separated by spaces, the segmentation portion of the CAT applies vacuously in most cases. Hence only the annotation portion of the program really applies to the input.
To show the improvement with the combination of ICTCLAS + CAT , we can process the same test sentences with the CAT only method and the ICTCLAS + CAT method. First the CAT only method:

As can be seen, CAT used by itself fares rather poorly with these test sentences.
In contrast, the ICTCLAS + CAT method yields very good results:

As can be seen, there is no mistake at all. The only remaining flaw is the three un-annotated two syllable strings without underline due to the lack of corresponding dictionary entries in CAT.
Automatic and exhaustive annotation is extremely useful pedagogically. However, the two existing annotators are quite flawed, especially in segmentation. We can use much better automatic tagging programs developed for natural language processing for their segmentation function. By combining the segmentation feature of the taggers and the annotation function of an annotator like CAT, we achieve much better results with minimal development efforts.
References:
Sproat, R. and
Thomas Emerson (2003) The First Intenational Chinese Word Segmentation Bakeoff.
Zhang, Hua-ping et al. (2003) Chinese Lexical Analysis Using Hierarchical Hidden Markov Model, Proceedings on the Second SIGHAN Workshop Affiliated with 41th ACL.