Last week, Jörg Tiedemann gave an IS-seminar on “Machine Translation for Under-Resourced Languages and Domains”. Jörg Tiedemann is a visiting Professor at the Department of Linguistics and Philology at Uppsala University, doing research in parallel corpora and machine translation. He is probably best know at DSV for Uplug, a collection of tools for processing parallel corpora, which has been used in several DSV student thesis projects.
The main topic of the seminar was how a closely related language can be of use for an under-resourced language, or for an under-resourced domain of a language. For statistical machine translation between two languages, you need parallel corpora, that is original and translated texts in these two languages. This does not always exist, especially not in the domain you need, and in those cases you can use an intermediate language. There are for instance many legal texts that exist both in an English and a Danish version (since both countries are in the EU), whereas there are not that many for the language pairs English/Norwegian. Standard machine translation techniques, using word alignment, can be used for constructing a machine translation system between English and Danish. To translate between Norwegian and Danish, a new method in form of character alignment, is instead used. Thereby, Danish can be used as an intermediate language for creating automatic translations between English and Norwegian.