Portal of autoconstructed dictionaries
2013/06/18 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia Iturria: Elhuyar aldizkaria
If there is a basic linguistic resource, that's dictionaries. And among the dictionaries, bilinguals are very necessary in many cases: in language learning, in translation... In the Language Technologies section of Elhuyar we also have bilingual dictionaries for automatic translations, multilingual search ...
However, the elaboration of dictionaries is costly. For this reason, bilingual dictionaries in Basque are not as abundant as we would like, and the same happens with other minority languages. Normally there are dictionaries for languages in contact (other local or nearby languages) or for the main international languages. But they are not made for other minority languages or for distant majority languages, which disadvantages minority languages such as Basque. For example, in the possibilities of learning the language by the immigrant population: it is not easy to learn Basque directly from its own language, always with Spanish, English or French as a bridge, so before you have to learn one of them...
Bridges to create dictionaries
The bridge language is a disadvantage when learning a language, but this idea can be used to create new dictionaries in a simple and economical way. In fact, almost all languages have some bilingual vocabulary with a “big” language (usually English). And we can use two dictionaries of this type, taking this “big” language as a bridge, to build a dictionary between two new pairs of languages. This is the technique of pivotage, which consists in the use of language as a pivot. Explaining it simply, if you place the English-Basque in an etxe => house dictionary and the English-German in a house => haus dictionary, then we conclude that etxe => haus. And so we can build a Euskera-German dictionary.
In the R&D department of Linguistic Technologies of Elhuyar we have been investigating with this technique in recent years with the aim of creating new dictionaries between Basque and other languages. In view of the previous example, it seems that the technique is very simple, but this example is very simple, because in reality a word can have multiple meanings and each of them can have several counterbenefits. This means that a simple string of dictionaries generates many erroneous equivalences, as seen in the example of the figure.
Therefore, the difficulty of this technique is that to create a quality dictionary it is necessary to automatically detect and remove those erroneous mistakes. For this purpose, two methods are used. The first narrates the number of paths between two words; the more ways, the more likely that equivalence is correct. The second measures the similarity of the contexts in which words are found in the corpus of both languages; the more they resemble contexts, the more likely they are equivalent. And, of course, to measure the similarity of contexts, a dictionary is needed, since they are found in different languages, for what the certainties obtained with the first method are used.
The application of these cleaning techniques, like any automatic method in linguistic technologies, never gets perfect results, that is, there will always be an error rate. This error rate is very variable, as it depends on several factors (languages, dictionaries used, corpus used, etc. ), but some measurements indicate that the percentage of correct results can vary between 60-80%. Evidently, they are not perfect dictionaries, but it is better than having nothing.
Portal of automatic dictionaries
Using the methods mentioned, we have created five bilingual dictionaries in Basque, selected 5 of the main languages of the three continents (Africa, Asia and Europe): Basque Arabic, Basque Country, Basque Country, Basque Country, Basque Country and Basque Country. In all of them, English has been used as a bridge language. We have used the English-Basque dictionary as Elhuyar's dictionary, and have taken five free dictionaries on the web for English and other languages. Constructed dictionaries are not very large: they are basic dictionaries between 8,000 and 21,000 entries. In fact, these dictionaries obtained on the network were similar. All dictionaries are in both directions.
All these dictionaries have been made available to the public in the Automatic Dictionary Portal (http://hiztegiautomatikoak.elhuyar.org). And when we say that we have made them available to the public, we mean that they are not only for consultation. On the one hand, all dictionaries can be downloaded entirely from the portal itself (since the dictionaries used in the origin were free, we also release those derived from them). On the other hand, and as mentioned above, dictionaries are not entirely perfect and present errors, the web allows users to participate in the correction and improvement of them through a system of marking the correct ones and errors.
On the web there is a field of word search in dictionaries. On the other hand, in the field of results we can indicate if the counter-performance seems correct or incorrect, since for each result, in addition to the word, the real uses of the corpus are shown, so that they serve as an example and to help the user to decide whether the result is good or bad. It also allows differentiating between safe and dubious payments. There is also a download section to download complete dictionaries in XML format. Finally, the website has a forum in which users can debate on the correction of specific words, make queries, etc. The web has interface in 8 languages and offers a virtual keyboard to perform searches in languages that do not use the Latin alphabet.
We do not want to leave the work done. We intend to create more dictionaries and include them in the portal. Also to give the opportunity for collaboration to be beyond the vote: for example, with the possibility of adding or modifying tolls and examples.
With the Portal of Automatic Dictionaries, for the first time we have related the Basque language with 5 other languages. They might look like distant languages, and perhaps it would have been before, but they are increasingly related to globalization and the internet. We believe they are an important resource and more in the future if we help each other improve.
X. Saralegi, I. Manterola, I. Saint Vincent. 2011. “Analying Methods for Improving Precision of Pivot Based Bilingual Dictionaries”. Conference on Empirical Methods in Natural Language Processing (EMNLP 2011). X. Edinburgh Saralegi, I. Manterola, I. Saint Vincent. 2012. "Building a Basque-Chinese Dictionary by using English as a Pivot". 8th international conference on Language Resources and Evaluation, LREC'12. Istanbul.