Master thesis: Using parallel corpora and Uplug to create a Chinese-English dictionary. Defended December 10, 2008

Authors: Hao-chun Xing (EMIS) & Xin Zhang (EMIS)


This master thesis is about using parallel corpora and word alignment to
automatically create a bilingual Chinese-English dictionary. The dictionary
can contribute to multilingual information retrieval or to a domain specific

However, the creation of bilingual dictionaries is a difficult task
especially in the case of the Chinese language. Chinese is very different
from European languages. Chinese has no delimiters to mark word boundaries
as in European languages. Therefore we needed Chinese word segmentation
software to insert the boundaries between each Chinese word in order to
correspond with English words. That was one of the difficult issues in our
project. We spend half time on it.

Our parallel corpora consists of 104,563 Chinese characters, that is
approximately 50,000-60,000 Chinese words, and 75,997 English words, mainly
law texts. We used ICTCLAS as the Chinese word segmentation software to
pre-process the raw Chinese text, and then we used the word alignment system
Uplug to process the prepared parallel corpora and to create the Chinese-
English dictionary.

Our dictionary contains 2,118 entries. We evaluated the results with the
assistance of nine native Chinese speakers. The average accuracy of
dictionary is 74.1 percent.

Key Words: Parallel Corpora, Chinese Word Segmentation, Uplug, Word
Alignment Tool

Download master thesis

About Hercules

I am a professor working at DSV-Stockholm University, I perform research in natural language processing and information retrieval, the last ten years I have been working on text mining on electronic patient records to build useful tools to improve health. Hercules homepage
This entry was posted in SYSLAB. Bookmark the permalink.