RE-PURPOSING AND RE-COMBINING RESOURCES

FOR AUTOMATIC ANNOTATION

 

Zheng-sheng Zhang

San Diego State University

zzhang@mail.sdsu.edu

 

            The present paper summarizes the pedagogical benefits of automatic and exhaustive annotation and explores the optimum exploitation of existing resources for such annotation.  While dedicated annotators are few and often flawed and new ones are hard to develop from scratch, I propose to re-purpose and re-combine the elements of existing resources for pedagogical use, including state of the art natural language processing tools.

 

I. Pedagogical Motivation

 

1. Why automatic annotation?

 

First of all, why the desire for easier annotation?  The primary consideration has to be the need to encourage greater amount of reading for students beyond the elementary level and outside the classroom.

Secondly, annotation by computer represents the best use of the technology, specifically its random-access feature.  The ‘when in doubt, click’ capability afforded by computer annotation is a welcomed relief from the cumbersome task of using the Chinese dictionary.

            Thirdly, annotation presupposes the segmentation of character strings into clearly demarcated word size units.  Segmentation not only aids the reader in locating word boundaries, it also makes for better conveyance of meaning, since the unit of vocabulary in Chinese is the word rather than the character.

Fourthly, using the computer for annotation can render an otherwise tedious process easy and fast.  This makes the instant preparation of reading materials readily achievable. 

            Last but not least, the ease and speed of automatic annotation makes exhaustive annotation a distinct possibility.

 

2. Why exhaustive annotation?

 

Exhaustive annotation, made possible by the computer, has important implications for many aspects of the learning process. 

Selective annotation, which is the norm of traditional textbooks, presupposes clearly defined targets and close articulation between levels.  In contrast, exhaustive annotation by definition eliminates the need for such pre-selection and articulation. 

            As a consequence, the use of reading materials becomes very flexible.  The same material can be used by learners of different levels for different purposes.

            The flexibility also allows real learner choice of pace and sequencing that is not possible with selective annotation.

II. Two Automatic Annotation Tools

 

Although automatic annotation is extremely useful both to the learner and the teacher, there are few available annotators at the present time.  We will examine the only two such tools that I am aware of:

 

1.  Chinese Annotation Tool (CAT henceforth) is written by Erik Petterson of the Mandarin Tools fame (www.mandarintools.com).  It is hosted, among other mirror sites, on the website for the Chinese program at San Diego State University (http://www-rohan.sdsu.edu/dept/chinese/public_html/annotate.html).  Available to anyone for free, it has been an invaluable resource for material development at SDSU.  Following is a screen shot of the tool:

 

 

The tool is very easy to use.  Text taken from a text editor or the internet (in GB or Unicode format) is pasted into the top window; a click on the Annotate! Button yields a new result screen.  There are a number of choices concerning the output.  The output can simply be segmented text or a pinyin version of the text; or it can be annotated with both English glosses and pinyin, which can be positioned at various locations with respect to the characters.  Following is a screenshot of the result screen (assuming the third option of ‘Add Dictionary Entries at status line’)

 

 

As can be seen from the result screen, the text is segmented and given annotation both below the original text in the form of endnotes and at the status line as the cursor moves over the underlined words (In the screen shot above, the cursor happens to be on the middle of the second line, which is 中国人, whose pronunciation and definition is given at the status line).  When the underlined word is clicked on, the corresponding endnote entry will be on the top of the screen for ease of use. 

The result screen can be saved as a html file, which can then be printed out.  But the printed version will lose the instant lookup feature as well as the link between the annotated and the annotation. 

Two problems of the program are obvious.  First of all, there are some characters that remain un-annotated, i.e., without underline.  For example, 单元。They are not annotated due to the lack of entries in the built-in dictionary.  A more serious problem is seen in the中国人 example.  As can be seen, the character should be grouped with and not 中国.  This mistake in segmentation causes more harm than the failure to annotate.

 

2. NJSTAR, the word processor developed by Hongbo Ni, is a favorite among Chinese students and teachers due to its many pedagogically useful features.  Its newest version to be released (beta version 4.5) has two more features to help the reader.  One is the instant lookup feature.  When the instant lookup option is enabled, a window with glosses pops up as the cursor moves over a word or a character.   The second feature is called word annotation.   A highlighted passage can be annotated paragraph-by-paragraph and appended below the original text.  The annotated file can be saved as a text (GB, Big5, Unicode) file.  

 

 

A cursory look at the result shows a number of differences from CAT.  The original text is not cut up into word-size chunks.  There is no indication of what is annotated and what is not.  In addition, quite a few words are not given annotations in the first paragraph, i.e., 是、的 、安、对、美方、向、中方、了。There also are no links between the annotations and the annotated words, as in the case of CAT.  In this paragraph, there is also one segmentation mistake, i.e., 王国 is treated as a common noun instead of part of a personal name.

 

III.  Evaluation of the Two Annotators:

 

  1. Differences between the two:

 

    1. Segmentation is explicit in CAT but implicit in NJSTAR   Segmentation is also more exhaustive in CAT than in NJSTAR   CAT segments more words than annotates, due to the use of a larger wordlist for segmentation than the dictionary for annotation.  NJSTAR however does not give independent indication of segmentation beyond what can be inferred from the annotations. 

 

    1. As annotated items are linked to annotations in CAT but not in NJSTAR, CAT takes greater advantage of the computer technology while NJSTAR looks more like the traditional textbook. 

 

    1. CAT is more exhaustive than NJSTAR in annotation as well, as it annotates single-syllable words as well as multiple-syllable words but NJSTAR systematically does not. For example, none of words in the sentence‘我要喝一杯水’will be annotated in NJSTAR. 

 

    1. NJSTAR’s text file output is easier to edit than CAT’s html file, which, in addition to the character text, also contains the above-mentioned dynamic links between the annotated and the annotation.

 

    1. CAT is free while NJSTAR is not.  It is not clear whether the free demo version of NJSTAR will include the word annotation feature.

 

Hence the choice between CAT and NJSTAR is quite clear: CAT clearly wins out.

 

  1. Problems common to CAT and NJSTAR

 

Due to the great reliance on the simple ‘maximum match’ algorithm, there are problems common to both programs, mostly concerning segmentation:

 

    1. Inability to deal with ambiguous segmentation:

美国会    美国+          or         +国会;

才能        +              or         才能 (n.)

 

这个人手上有伤》这+++手上++伤。

*+个人+手上++伤。

*这个+人手+++伤。

 

    1. Wrong segmentation due to overlapping entries (法国人 and  法国):

法国人民愤怒火烧酒厂》法国+人民+愤怒+火烧+酒厂

*法国人+民愤+怒火+烧酒+

 

    1. Wrong segmentation/inconsistent annotation for unlistable (names etc.):

CAT/ NJSTAR: 王国安à*王国+安:

CAT: 周恩来、毛泽东、肯尼迪;  not尼克松

NJSTAR: 周恩来,not 毛泽东;尼克松,not肯尼迪

 

    1. Missing words that are listable:

CAT: 单元、员工、民愤; 

NJSTAR: 发热、发冷、发呆 but not 发傻、发愣

 

                   f.   Missing  gloss:  CAT: as ‘this’

 

  1. Tweaks on CAT:  Due to CAT’s open nature, it is possible to improve on its performance without having to know how to write codes in perl script.  What can be done to alleviate the problems with CAT?  The following seems possible:

 

    1. Character annotation: A drastic step to eliminate segmentation errors is to eliminate segmentation itself, i.e., to annotate each individual character instead of segmented words.  This can be achieved by inserting spaces between the characters in the input text.  This can be done either manually or by using NJSTAR’s pinyin conversion options, which, along with pinyin annotations, allows the user to insert spaces of different width for better alignment with characters.   Although we can no longer benefit from segmentation, the pinyin annotation and glosses on individual characters is better than nothing and more informative than the exhaustively bopomofo-annotated国语日报, which has been much appreciated by students.

 

    1. Pre-segmentation: A less drastic tweak is to apply selective, manual pre-segmentation on the input text.  This is made possible by CAT’s sensitivity to spaces, that is, it will not group characters separated by spaces into words.  This should not take too much time, but extreme sensitivity to potential ambiguity is required on the part of the editor.  One practical strategy is to do automatic annotation first and examine the results to spot any mistakes and then do selective pre-segmentation.

 

    1. Editing the dictionary: as the dictionary used by CAT is an amalgamation of free dictionaries online, the quality of the dictionary leaves much to be desired.  For example, in addition to missing entries the dictionary is full of overlapping entries such as 中国、人民、人、中国人、中国人民、银行、中国银行.  The situation is exacerbated when there is no uniform guideline in which longer entries should be included or excluded.  A case in point is the presence of 法国、人民、法国人 but the absence of 法国人民, which is different from the inclusion of 中国人民.  The longer entries may also cause wrong segmentations.  When法国人民 is the input, the wrong result法国人+ is inevitable given the sole use of the maximum match algorithm.  But as there are huge number of entries, editing has to be a team effort and done with clearly laid out guidelines.

 

  1. Drastic overhaul?  Much of the problem outlined above stems from the simple ‘maximum match’ algorithm used.  The following is needed for a better scheme:

 

    1. Linguistic knowledge
    2. Contextual information
    3. Statistical information
    4. Algorithm for unlisted words

 

Obviously, to do this is beyond the average language teaching professional.

 

IV.  Re-purposing of Natural Language Processing Tools

 

            A more effective and efficient strategy than tweaking and radical redesign is ‘re-purposing’, that is, using tools that are originally intended for natural language processing.  Judging from the participants in the First International Chinese Word Segmentation Bakeoff, which took place on 4/22-25, 2003, sponsored by SIGHAN (Special Interest Group on Chinese Language Processing) of ACL (Association of Computational Linguistics),  many natural language processing tools are being developed by research and private institutions in mainland China, Taiwan, Hongkong, Singapore, Japan, the United States and elsewhere.  In this section, we will examine two freely downloadable auto-tagging programs, which potentially can be used to replace the weaker, segmentation portion of CAT. 

 

  1. Original purpose: These two programs are originally developed to provide parts of speech (POS) tagging, which is the prerequisite to parsing, machine translation, information retrieval and extraction.  There is no annotation function in these two programs.  Segmentation is only a prerequisite and a by-product.

 

  1. Up-to-date and ‘state of the art’: Both represent the latest efforts, as both took part in the First International Segmentation Bakeoff.  The percentage of correct segmentation is around 90%.

 

  1. CKIP (Chinese Knowledge Information Processing Group) ‘s WinAT (windows-based auto-tagger). It is developed by陈克健et al. at Academia Sinica in Taiwan.

 

The program can be freely downloaded from:

http://rocling.iis.sinica.edu.tw/CKIP/ws/.

Help is available at: http://godel.iis.sinica.edu.tw/CKIP/ws/help.htm. 

An introduction by the developers to the algorithms used can be found at

http://216.239.39.104/search?q=cache:gUJa283l92AJ:acl.ldc.upenn.edu/W/W03/W03-1726.pdf+ckip+segmentation+bakeoff&hl=en.

 

The program has standard windows type interface and clearly written help file.  On the main screen, there are two windows.  Into the top window input text can be taken from a Big5 text file or pasted.  Output is given in the bottom window.  Output options include choices between having parts of speech tags or no tags and the presence or absence of line numbers and dividers.  For the purpose of using the output to feed CAT, it is best to set the options to no tags and no line numbers and dividers.  Output can be saved as a text file or copied from the output window and pasted into another program.  If a Chinese system software is not used, then a Chinese text display program such as NJWIN (set to Big5 coding, with Language Bar disabled if Window XP is used) has to be used to see the characters.  Following is a screenshot:

 

 

 

 

4. ICTCLAS (Institute of Computing Technology Chinese Lexical Analysis System).

It is developed by 张华平 and 刘群at Chinese Academy of Sciences. Free downloading is available at http://www.nlp.org.cn/ and pages detailing its operation can be found at http://pipy_world.y365.com/Manual/ICTCLAS_API.htm and www.nlp.org.cn/project/project.php?proj_id=6.  In addition to the PC version, there is also an online trial version at: http://mtgroup.ict.ac.cn/~zhp/ICTCLAS.htm.  Unlike the PC version, it is harder to use the results as input to CAT.  So we will not discuss it any further.

Although the operation of ICTCLAS is explained in a separate file, the interface is not as user-friendly as WinAT.  Like WinAT, there are two windows containing input and output respectively.  As with WinAT, if a Chinese system software is not used, then a Chinese text display program such as NJWIN (set to GB coding, with Language Bar disabled if Window XP is used) has to be used to see the characters.  The interface is not the standard windows type though.  The buttons and choices between the two windows, not clearly marked, provide output options, i.e., tags versus no tags, more detailed tags versus less detailed ones, output display formats and so on.  There are also functions not found in WinAT.  One button allows the processing of text files unmonitored.  The small leftmost window right above the output window allows the choice of the number of different results in the case of ambiguous segmentation and tagging.  The other two windows above the output window shows the time taken to process the input and the accuracy attained.  Following is a screenshot of ICTCLAS, with the options set at no tags, only one result displayed:

 

5. Comparison between WinAT and ICTCLAS:

 

            Although less user-friendly, ICTCLAS is better than WinAT because:

a.       ICTCLAS provides more output options, especially the capability of providing multiple versions when there are ambiguous segmentations.

b.      It deals with ambiguous segmentation much better.  For example, 美国会.  Not only it makes the right decision for the context, there is also the option of having the alternative results. 

c.       It deals with unlisted items, such as names much better. For example, 王国安.

d.      It does not suffer from the mis-segmentation due to the overlapping of longer and shorter entries, such as the case of 法国人民.

 

To illustrate, the same test sentences, listed below, are fed first to WinAT and then to ICTCLAS.  The * items are potential wrong segmentations.

这个人手上有伤。(*个人、人手)

美国会议员工作主要是非难。(*美国、会议、员工、作主、要是、是非)

克林顿对内塔尼亚胡说:(*对内、胡说;内塔尼亚胡:Natanynahu )

法国人民愤怒火烧酒厂。(*国人、民愤、怒火、烧酒)

王国安总经理代表东方进出口公司对美方表示欢迎。(*王国)

As can be seen, ICTCLAS segments the sentences exactly as a native speaker would.  WinAT on the other hand is led down the garden-path by the potentially ambiguous strings in most cases.  The choice therefore is quite clear: ICTCLAS provides better segmentation results.

 

V. Re-combination of resources

 

            Given the superior segmentation performance of ICTCLAS and the unique annotation function of CAT, the logical step to take is to use the segmentation result of ICTCLAS to feed CAT for annotation, in the following manner:

 

Input text à ICTCLAS àsegmented textàCATàannotated text out

 

The output text from ICTCLAS is copied and pasted into the input window of CAT.  As CAT is sensitive to space and therefore will not regroup characters separated by spaces, the segmentation portion of the CAT applies vacuously in most cases.   Hence only the annotation portion of the program really applies to the input. 

To show the improvement with the combination of ICTCLAS + CAT , we can process the same test sentences with the CAT only method and the ICTCLAS + CAT method.  First the CAT only method:

 

 

As can be seen, CAT used by itself fares rather poorly with these test sentences. 

In contrast, the ICTCLAS + CAT method yields very good results:

 

 

As can be seen, there is no mistake at all.  The only remaining flaw is the three un-annotated two syllable strings without underline due to the lack of corresponding dictionary entries in CAT.

 

VI.  In Conclusion

 

Automatic and exhaustive annotation is extremely useful pedagogically.  However, the two existing annotators are quite flawed, especially in segmentation.  We can use much better automatic tagging programs developed for natural language processing for their segmentation function.  By combining the segmentation feature of the taggers and the annotation function of an annotator like CAT, we achieve much better results with minimal development efforts.

 

References:

 

Sproat, R. and Thomas Emerson (2003) The First Intenational Chinese Word Segmentation Bakeoff.

 

Zhang, Hua-ping et al.  (2003) Chinese Lexical Analysis Using Hierarchical Hidden Markov Model, Proceedings on the Second SIGHAN Workshop Affiliated with 41th ACL.