BerbaTek: Linguistic technologies in Basque on the move
2012/02/01 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia Iturria: Elhuyar aldizkaria
If in the last three years you have followed this section "Digital World", you will be convinced that linguistic technologies will be more and more important in the world of mobile devices and always connected. We have told you about technologies such as semantic web and semantic technologies, machine translation and corpus, question answering systems, dialogue agents, intelligent search engines... which have a significant and growing presence in this new world. These technologies still have a way to go, but in some cases they are as advanced as they are useful and many devices and services are integrated, as we have told you here.
However, in general, they are only for the most widespread languages (often only in English); large companies are not interested in introducing Basque in them. And even if they had it, they are not willing to assume the cost of adapting these technologies to the Basque language. This adaptation to the Basque language is not merely a work task; sometimes it is necessary to carry out basic research, develop basic resources...
Thus we have been working the Elhuyar Foundation, the research groups IXA and Aholkularitza of the UPV, and the technological center Vicomtech-IK4 and Tecnalia, within the project BerbaTek, between 2009 and 2011, in the research of linguistic, voice and multimedia technologies for the Basque language (mainly). The departments of Industry and Culture of the Basque Government have financed part of the budget of the BerbaTek project through the Etortek program.
It is not the first time that these 5 organizations collaborated in the research of linguistic technologies. We were previously working on the Hizking XXI project in 2002-2004 and the AnHitz project in 2006-2008. At the end of the latter, we built a demo of virtual expert in science, also called AnHitz, a 3D avatar with oral interaction capable of answering scientific questions and conducting multilingual searches.
In the BerbaTek project we have done a great basic research: we have developed or improved many basic resources and tools (corpus of text or voice, lexical, dictionaries, ontologies, computational grammars, morphosyntactic analyzers, voice recognition, voice synthesis, dialogue systems...), and we have worked in different technologies (machine translation, search for information, extraction of information, writing aid systems, response systems. The technologies developed in it have been used in different projects and services.
At the service of the language industry
Although the BerbaTek project is a research project, the practical use of this research has been from the beginning one of the main objectives for us. And we wanted to give this practicality in the field of the language industry.
It is understood by linguistic industry that consists of three subsectors: translation (translations, locations, interpretation, dubbing...), contents (editorials, media...) and teaching (language teaching, regulated teaching...). In the Basque Country, the first steps have recently been taken to structure the sector of the language industry: In 2010 the Association of Companies of the Basque Country of the Langune Language Industry was created, with more than 30 partners. Since its inception, the members of BerbaTek have been actively involved and BerbaTek has a vocation to serve as technological support for industry and association.
Many of the technologies developed in BerbaTek have a direct application in one of the three sectors of the language industry, and other tools, resources and technologies are applicable in either of them or are the basis for the development of other technologies.
The scheme graphically represents the language industry and its areas, and what BerbaTek can bring to each and in general.
Demos Demos Demos Demos Demos Demos
As already mentioned, BerbaTek has the vocation to be a practical application in the language industry, and proof of this is that for the three subsectors of this industry we have built a demo combining different technologies.
As proof of the contribution of linguistic technologies in the field of content, we have carried out a semantic multimedia search engine of science and technology. This search engine is based on the specialized ontology WNTerm in science and technology, built by Elhuyar and the IXA Group (a network semantic related to concepts of science and technology, with subclasses, synonyms, etc.) and about the content of Elhuyar (images and texts of the magazine Elhuyar, video of the television program Teknopolis and audio of Norteko Ferrokarrilla). Through the technology developed by Tecnalia, when looking for a term, ontology also allows searching for contents containing synonyms, subclasses or superclasses of this term. In addition, when the result is an image, it offers similar images using Vicomtech-IK4 technology.
In the field of translation, an automatic dubbing demo of documentaries has been made. Automatic duplication of films is a difficult challenge for the moment (many voices, colloquial language, different speeds...), but with some types of documentaries (a single speaker, voice-over, coordination with lips is not necessary or important...) we have done a demo that works well. By issuing a documentary in Spanish and a transcription of what is said there (this transcription can be obtained automatically if you want, since dictation programs for Spanish exist in the market), the technology of temporary alignment of Vicomtech-IK4 allows to obtain a subtitle file (the transcription, but with the initial and final moments of each sentence). Later, the automatic translator Matxin of the IXA Group translates these subtitles to the Basque language, and Zapore Jai's text voice conversion technology generates a synchronized voice in Basque. This demo has been successfully applied to the sections of a single speaker of the Teknopolis program carried out by Elhuyar.
Finally, we have carried out the demo of a personal tutor of language teaching for the field of teaching. This tutor is a 3D character capable of expressing emotions, developed by Vicomtech-IK4, who speaks in Basque and understands what is said in Basque, thanks to the technology of Zapore Jai. And the tutor can help us in: The technology of IXA allows us to perform grammatical exercises (verbs, decline...) or comprehension exercises (fill the gaps of a text giving several options) created automatically; it evaluates the pronunciation thanks to the technology of Aholab; or offers aids for writing (behavior of verbs, writing of numbers, dictionary queries...), Through the technology of IXA and Elhuyar.
Dissemination of information Dissemination of information
In the BerbaTek project we give importance, in addition to basic research and practical application, to dissemination. For us, it is essential to make known the work carried out in research forums, congresses and specialized magazines, but also to show society in general the importance of linguistic and voice technologies and to make known the achievements that we have made in this field for the Basque language. To achieve this last objective we have developed a web page ( http://www.berbatek.com ) in which in addition to informing in general about the BerbaTek project, we inform periodically of the advances made in it. In addition, through the Observatory of Linguistic, Vocal and Multimedia Technologies (A Search Engine of other websites), we make known what happens in the world of linguistic technologies, as well as the most important events at local and international level through the Events Calendar.
We are very satisfied and proud of the results obtained by BerbaTek in the project. But if the Basque language does not want to stay behind in linguistic technologies and, therefore, in this new digital world, we still have to work hard in the coming years. All the members of the BerbaTek project are ready to face this challenge.
Gai honi buruzko eduki gehiago
Elhuyarrek garatutako teknologia