First Online Version of the Corpus of Science and Technology
2006/12/01 Gurrutxaga Hernaiz, Antton - Elhuyar Hizkuntza Zerbitzuak Iturria: Elhuyar aldizkaria
To feed the corpus, the science and technology works published between 1990 and 2002 have been taken into account. The corpus is classified by field (area of knowledge) and gender.
The corpus is labeled, both in terms of the structure and format of the text and linguistic level. Linguistic labeling has been carried out using advanced automatic processing technology of the Basque language (Eustagger label of the IXA group). The motto and category/subcategory of each word of the text are labeled. In this version of the corpus there are 8 million words, of which 1.6 million have been revised, disambiguated and manually corrected. The corpus is labeled in XML and the TEI standard has been followed.
A powerful consultation interface of the corpus has been organized, in which the user can perform simple and complex searches of all types, using for it a wide set of parameters: slogan, form of text, category, field, gender, section of corpus (manual corrected/corpus complete...). The results can be of two types. On the one hand, the short contexts (KWIC) and the extended contexts of the object of study, and on the other, the quantitative information, expressed in tables and graphs (frequencies, publications, distribution by res, etc. ).
The corpus will be available at www.ztcorpusa.net. In addition, from 2007 it will be available among the OECD resources for its commercial exploitation by license.
The texts introduced in this first version of the corpus have been collected in digital format by different suppliers thanks to the agreements signed with them. To all also our most sincere thanks.
The Corpus of Science and Technology project began to develop within the Hizking21 strategic research project. The Hizking21 project has received the following grants: Etortek Program of the Department of Industry of the Basque Government (2002-2004) and Gipuzkoako Zientzia Program, Teknologia eta Berrikuntza Sarea de la Diputación Foral de Gipuzkoa (2004). On the other hand, the Corpus de Science and Technology has had the collaboration of the Department of Culture of the Basque Government in the program Euskara and New Technologies 2005.