Speaking of the language of machines. Soles of experts
2009/11/01 Roa Zubia, Guillermo - Elhuyar Zientzia Iturria: Elhuyar aldizkaria
Eneko Agirre: I think they are issues related to understanding. The research carried out in recent years has meant a great qualitative leap, but that does not mean that the machine "understands" us now. I think they have taken small steps and the machines understand things in more and more areas. What is a place, for example. With the surnames there is always a problem, is Azpeitia a person or a place? Or a company? Beginning to understand these things is a step forward. And even if people seem very simple, without context they are difficult. Therefore, the challenge is to teach the machine fragments of this type of knowledge.
In fact, the mathematical and statistical methods based on corpus are playing in some way, they are doing what they could do and have difficulty advancing. Those based on the rules also gave their own and were somewhat stuck. Therefore, I believe that now the challenge is to learn the rules of the texts, and from the corpus try to contrast them by learning and contrasting them in some way, and know what has learned well and what evil.
Kepa Sarasola: To see what are the challenges we face today, there can be two levels: one of applications and one of guts within the language, basic tools that should then be used in applications. It can be affirmed that the needs of lexicon currently are almost 100% covered. Twenty years ago there were no computer dictionaries, all of them paper. Now you have on the Internet the meaning of all words, how they are said in other languages, etc. Morphologically, for difficult languages (such as Basque), 95-98% is full. In syntax it does 90% well for English.
So, what are we going to go to? So to the semantic and the pragmatic. And for this, here is a tremendous change. 20 years ago, to talk about any topic, we do not talk about what. Today, for example, we have Wikipedia, or Wordnet, Internet itself, etc. Now we have new resources to understand the meaning of the texts. And that has opened a door for us, but still not much has been worked.
I Aki Joy: The Congress was attended by guest speakers who reflect on the subject. For example, the syntax expert at the University of Uppsala, Joakim Nivre, pointed out that the syntax problem is not solved at 100%, but is very much worked. Following with the semantics, Eneko presented the situation to which he referred. The KYOTO project was also presented, a system that allows defining the meanings of words and terms through a wiki platform. The extraction of data knowledge was also discussed. And in his speech, Horacio Rodríguez, of the Polytechnic University of Catalonia, pointed out that we have to try to resume some of the challenges of classical artificial intelligence, but with more data and new ways. And I'm also a bit of that opinion.
On this path, Google has obtained very good results using some basic artificial intelligence methods. But if they do not use a deeper knowledge, in the short term little innovation will emerge.
I. A. A. I think Google is inventing taking advantage of what is done. He invests a lot, takes good advantage, has gained fame and has made a mark. This knowledge or tools could be integrated into applications for all audiences and at the industrial level. But they do not provide enough information and the demand for applications is less than expected.
R. R. A. A. In research you do not know who will come with the good idea. Although there is a large research team, perhaps good ideas don't come out of there, it can't be predicted. For this reason, large companies, such as Google, in addition to developing their projects, sign successful researchers.
Many people have gone to Google. In the United States they have mentioned that the best researchers have gone to Google. Among the young people, many people have been received and that has been noticed in the universities. People have gone there, then they have said that in Google not everything is so nice, but very few have made fame from there.
I. A. A. In this area, the applications that give money are detailed. Killer applications. Historically, three types of applications have been included in this group: machine translation, proofing tools (that is, tools for text editors, mainly correctors) and search. Precisely, the beginning of Google was the world of search. Machine translation is now being processed and lately it is also working on phone operating systems and on proofing tools. Somehow, the risk may be that Google monopolice all these investigations.
C. C. S: We, on the one hand, are happy because it is clearly seen that the techniques we work are useful. It is shown again and again. But, on the other hand, we are concerned that Google has data because they are the only ones. They know what people ask for, what they seek. And what people choose in the search results. For them it is very important to improve the system. Asking for a word most people click on the fourth option and soon after that fourth will be the same. These usage data are very important, but they are owned by Google.
R. R. A. A. Google knows that innovation is the way to move forward. They direct all energies to innovation.
I. A. A. And they give priority to money. The money, there, they. And that has some consequences. For example, Google seeks very badly in Basque. And they have been told. But they are not interested. At a given time it was decided to work with a maximum of forty languages. In the rest they make a literal search. That's a problem, but the brand has a lot of strength. It also integrates into many applications, etc. But today the Elebila app is looking for a much better version in Basque.
I. A. A. English is the reference. For example, a researcher from Ethiopia came to congress. There they speak in their mother tongue. It is a Semitic language, they have to use another type of keyboard, but since there are no such keyboards on mobile phones, messages are sent only in English.
It is clear that the Basque language is small. From an economistic point of view, demand is low, so there are problems. At the research level, we are satisfied. In some areas, at least, we are a reference for other minority languages. Applications based on corpus require investments to obtain their own corpus.
R. R. A. A. As a language, Basque has its own typology, but it is not especially difficult to compute if we compare it with other languages. Although morphology is more difficult to treat, in other areas, such as phonetics, it is very easy. Each language has its difficult and simple differences, but in general, taking into account all the characteristics of the language, the difficulty of all languages is similar.
And to compare with other languages, you have to see each language according to the number of speakers. I believe that the Basque language is quite close to the languages spoken most. The most significant difference is the small size of the corpus used, which I think is the main lack in Basque. In English, for example, there are a corpus of billions of words. And the machines learn from large corpus. But, depending on the resources, we are at the top of the list.
C. C. S: As for the number of speakers, I saw the Basque language in the list 256, and in the research we are among the first 50. Why? Because there have been official aids, and I think those of us here do the things ordered. We have done things in an orderly and planned way. The tools and resources you generate at a given time are valuable in the future. We work incrementally.
The IXA group works on the processing of the Basque language. They are not the only ones. But a robot is a reference researcher in the effort to speak in Basque. If large companies, for example, would like to develop applications in Basque, they should probably go to them. Among others, they have participated in the development of the ANHITZ project, creating a virtual character that answers scientific questions. In short, a robot that speaks. It is a good example of language processing; seen from the outside, ANHITZ does not seem like a revolutionary application, since it does not respond as quickly and easily as a fictional robot. On the contrary, who knows the work behind the project carries out a very positive assessment. There is much to do in language processing, there is no doubt. But what is done is a huge job, there is no doubt about it.
Gai honi buruzko eduki gehiago
Elhuyarrek garatutako teknologia