OpenTrad, opposite the Tower of Babel
2006/04/01 Galarraga Aiestaran, Ana - Elhuyar Zientzia Iturria: Elhuyar aldizkaria
The Journal of Catalonia is published daily in two languages: Spanish and Catalan. To do this, they don't have twice as many employees as other newspapers. Its secret is an automatic translator. Journalists write the newspaper in Spanish and then the machine translator places it in Catalan. They direct the text between several correctors and is willing to go out to the street along with the Spanish one.
The Journal of Catalonia is a significant example of the value of automatic translators. In addition, the translator that has the newspaper is not the only translator from Spanish to Catalan, but many other examples. For example, the University of Alicante created interNOSTRUM for the Mediterranean Savings Bank. Translation is done in two directions and now allows anyone to use it for free on the website of the same name. Yes, it supports texts up to 16,384 characters.
In addition, in the Spanish state there is an automatic translator from Galician to Spanish, but it is a very closed and limited product. And in Basque, what? So far little. The IXA group of the UPV/EHU Faculty of Computer Science was developing an automatic system for translating English into Basque, but they did not move as fast as they wanted.
That was the situation two or three years ago. However, the OpenTrad development project was launched in 2004. In fact, researchers who developed interNOSTRUM were known to the IXA, and Eleka Linguistic Engineering and IXA work together. They joined similar jobs in Galicia and began to create an automatic translator of open source thanks to the grant of the Ministry of Industry, Tourism and Commerce.
According to Eleka, Iñaki Arantzabal defined objectives at two levels from the beginning: "On the one hand, we wanted to get a good automatic, fast and open source translator for Galician and Catalan and Spanish couples and, on the other, a prototype to translate from Spanish into Basque. It should be noted that the starting point of all languages was not the same: the Castilian-Catalan couple was quite advanced and, at the other end, to automatically translate from Spanish into Basque, almost everything was about to be done."
Close by area
In this, logically, the difference between languages influences a lot. It is evident that Spanish, Galician and Catalan are much closer to each other than any of them. Consequently, it is much easier to get a good translation system between Romanesque languages than when Basque is present.
Therefore, OpenTrad has two automatic translation engines, one for translations between Romanesque languages, Apertium, and another for translating from Castilian to Basque, Matxin.
Both are based on language rules. There are several forms of machine translation, but the main ones are those based on collections of previously translated texts, that is, corpus, and those based on linguistic rules --word order in the phrase, declination, verbs...--.
Iñaki Alegría, from IXA, explained that "linguistic rules-based systems work in three phases. First they perform a syntactical-morphological analysis of the original text, then make a transfer to the other language and finally create the text in that second language."
For translation between Romanesque languages, although the transfer is superficial, good results are obtained. This is what InterNOSTRUM does and from there they have set out to develop the Apertium engine. In a way, Apertium interNOSTRUM is an improved open source version.
That is what the Catalans have won over all, that the code be opened. In addition, OpenTrad has the code completely separate from language resources. Thanks to this, the system offers all the facilities for interaction and adaptation to the needs of the user. The system is willing to take on all the changes you want to make to enrich and improve.
Apertium not only performs syntactic transfer. In addition, it has several 'filters' to refine the translation. For example, it is able to detect structures typical of one language and give them its equivalent in the other. Consequently, a higher quality translation is achieved. For example, the translator of the Spanish-Catalan duo has a 95% reliability, that is, only five out of every hundred translated words are wrong.
Far in depth
However, Apertium does not serve to translate from Spanish into Basque. Languages are so different that superficial syntactic transfer is not enough. The structure of sentences also changes radically, so it takes a deep syntactical-morphological analysis engine capable of building a dependency tree, making a transfer and producing the text in Basque. They created Matxin.
The IXA has recognized that to develop Matxin they have had to do a "hard job", and the result is not as good as what Apertium offers in the translation between Romanesque languages. In any case, they have achieved the initial goal, since it was about creating infrastructure.
On the other hand, the quality of translation in the development of the machine translator has been one of the main concerns, but it has also dealt with the speed of the system. In this sense, they have recognized their satisfaction. This allows you to browse the web pages in the translated language. According to an example of Arantzabal, being the original Spanish, there is the possibility to navigate in gipuzkoa.net in Catalan and Galician through OpenTrad.Looking forward
So far, a good and fast automatic system has been achieved that translates in both directions for the Galician-Spanish and Catalan-Spanish couples, as well as a prototype to translate from Spanish to Basque. In the words of the head of Eleka, "we have achieved the goal".
But they have no intention of staying there. "We want to keep improving and completing. One way to improve results is to focus on specific areas. Each area uses its own language, with less ambiguity problems than acting in general. Therefore, quality increases by adapting the translator to a field, for example, by incorporating the corresponding terminological vocabulary." They hope to improve their reliability.
In addition, they intend to add others to the technology as a rule, specifically, they want to use parallel corpus. "In this way, if you want to translate a phrase, you will first see if it is already translated or if there is something similar. If there is something earlier, it will start from there to translate. Instead, if there is nothing similar, it will use rule technology."
In addition to improving and complementing, they want to create an automatic translator to translate from Basque into Spanish. Thanks to this, the outsiders would have the opportunity to know what is created in Basque. Another objective in the future is to be able to translate from English to Basque.
To make these advances, Arantzazu hopes to have the support of the Basque Government. In fact, a few years ago the Basque Government commissioned a Catalan company to develop a prototype machine translation. Now, OpenTrad is the state's most advanced system. That is why Arantzazu says: "We want to convince the Basque Government to boost our system. We believe that at least you can't stay out."
· http://www.opentrad.net
· http://opentium.sourceforge.net
· http://matxin.sourceforge.net