The paper Improving Dictionary Construction by Automatic Identification of Parallel Text Pairs which I’ve written together with Martin Hassel and Hercules Dalianis has been accepted to the international symposium on Using Corpora in Contrastive and Translation Studies (UCCST) which will be held in Hangzhou, China, 25th–27th September 2008. The experiments described in the paper have been part of the TvärSök project.
Abstract:
When creating dictionaries for use in e.g. cross-language search engines, parallel or comparable text pairs are needed. For instance, a website, Hallå Norden, containing information regarding mobility between the Nordic countries has information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered two main problems: the parallel corpus was very sparse containing on average less than 80 000 words per language pair, and it was difficult to automatically detect parallel text pairs. Creating dictionaries with the word aligner Uplug gave in average 213 new dictionary entries. Combinations with Finnish, which belongs to a different language family, had a higher error rate, 33%, whereas the combinations of the Scandinavian languages only yielded on average 9% errors. Despite the corpus sparseness the results were surprisingly good compared to other experiments with larger corpora.
Following this work, we made two sets of experiments on automatic identification of parallel text pairs. The first experiment utilized the frequency distribution of word initial letters in order to map a text in one language to a corresponding text in another in the JRC-Acquis corpus (European Council legal texts). Using English and Swedish as language pair, and running a ten-fold random pairing, the algorithm made 87% correct matches (baseline-random 50%). Attempting to map the correct text among 9 randomly chosen false matches and one true yielded a success rate of 68%. In another experiment features such as word, sentence and paragraph frequencies were extracted from a subset of the JRC-Acquis corpus and used with memory-based learning on Swedish-Danish, Swedish-Finnish and Finnish-Danish, respectively, achieving a pair-wise success rate of 93%. We believe such methods will improve automatic bilingual dictionary construction from unstructured corpora and our experiments will be further developed and evaluated.
The full paper will be completed during this summer.