SpråkVis - Language Technology Expert Panel Report

by Krister Lindén, Kimmo Koskenniemi och Torbjørn Nordgård

Extended Summary

The Nordic Council of Ministers has commissioned a ten-year plan in the form of an expert panel report for making the Nordic Countries a leading region in language technology (LT). Six key areas were identified: LT Policy, LT Resources, LT Research and Development, LT Training and Education, LT Legislation and LT Business Aspects, for which we present recommendations in this Expert Panel Report. Finally, we also suggest an action plan.

LT Policy

We need to raise awareness that LT has a key position for protecting and maintaining our languages and our culture. LT is necessary e.g. for developing a digital infrastructure for research in the humanities and the social sciences. It does not matter whether LT is academic, open source or commercial, as long as it exists and its modules are compatible and available for building large systems and applications. Small language communities will not get LT on a commercial basis alone, so most (or all) languages in the area need at least some public support and some may be totally dependent on it. At the Nordic level, we need to establish recommendations for the actions on the national level. To assess the situation for language-specific and language-independent resources for the languages in the area, a Basic Language Resource Kit (BLARK) report for the Nordic languages should be prepared. The Nordic region needs to stay abreast with the development in the EU in order not to duplicate efforts and in order to focus on the aspects that are specifically Nordic.The participants of the NODALIDA 2005 decided to establish an association for speech and language technology which will be called NEALT (Northern European Association for Language Technology). Such an association would be ideal for coordinating various initiatives and networks.

Action areas, where Nordic funding is needed instead of national funding, are:

  • establishing and starting NEALT and establishing a scientific electronic journal by NEALT,
  • some form of continuation for the Nordic LT documentation centers, see awareness under LT Training and Education,
  • some continuity for the NGSLT, by NordForsk, see LT Training and Education, and
  • individual small-scale projects (possibly carried out and coordinated by NEALT) e.g. to prepare more detailed recommendations for
    • altering the legislation of intellectual property rights (IPR, see LT Legislation),
    • guidelines for funding agencies to guarantee access and reuse of LT resources created with public funding (see LT Research and Development), and
    • guidelines for research and/or commercial use of dictionaries and word lists created as part of publicly funded dictionary compilation (see LT Resources).

Key Area Magnitude of funding needed Parties involved Mode of cooperation
NEALT start-up 50 kEUR NMR for funding association, working groups
BLARK Report 10-25 kEUR per language NorDokNet, NEALT national projects coordinated at the Nordic level

LT Resources

The most obvious and substantial investment would be to create an appropriate infrastructure which has sufficient LT resources for relevant languages of the area. The resources belonging to the infrastructure should be freely available for research and training as well as for commercial product development. Based on the assessment of the situation in the BLARK report the most urgent gaps in availability of corpora should be filled in using national funding with cooperation on the Nordic level for developing and exchanging language-independent tools and methods.

LT modules

Both commercially and academically created LT modules need compatibility and capabilities for reusing other modules and resources. Language-independent tools can be used for creating both kinds of modules, and common API interfaces make it possible to utilize module combinations in order to facilitate interoperable and multilingual products and systems.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Openly available LT modules and common APIs 2-5 MEUR open source community, universities, public and private institutions, NEALT Nordic LT network

LT tools

Freely usable language-independent state of the art tools are needed so that investments in LT modules are not lost in the long term. Interoperable components and multilingual products and systems can be achieved through such tools. E.g. finite-state technology provides very efficient and modular implementations for a number of tasks.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Openly available LT tool 2-5 MEUR open source community, universities, public and private institutions, NEALT Nordic LT network

LT corpora

Speech and text corpora and their combinations are necessary starting points for many types of LT modules and applications. The required quantities have grown in magnitude. Different levels of annotation are necessary for various methods and research topics. The availability of corpus material is often too restricted excluding all commercial use and, at the same time, any development of LT modules. Model contracts for collections of copyright-protected corpora should be created for all countries, and these model contracts should guarantee the necessary ways to use the materials.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Model contracts 50 kEUR research organizations, lawyers, NEALT networking across countries
Corpus collection, written text 10-15 MEUR pr language universities, NEALT networking across countries
Corpus collection, spoken data 10-20 MEUR pr language universities, NEALT networking across countries

LT lexicons

Dictionary materials which have been developed with public funding ought to be published as open source material so that they can be used for creating LT modules such as parsers and analyzers. More specifically, lists of headwords annotated with part of speech and inflectional class should be made available under very free conditions permitting their use in both academic and commercial contexts. The full text of dictionaries published as books may be reserved for academic use, but there must not be limitations on further use of methods, rules or programs which have been developed using such material, provided that they do not contain parts infringing on the copyright of the original work.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Lexicon development 10 MEUR per language universities, NEALT networking across countries

LT Research and Development

The academic funding institutions ought to adopt recommendations or rules concerning linguistic resources which will be (or have been) developed using public funding. It ought to be a normal requirement that the researchers make the linguistic resources available for the rest of the research community with as free conditions or licenses as possible. In addition we may need to open up language resources on all levels (lexicons, grammars, written language corpora and speech corpora, etc.) which have been created through public funding. Common interfaces and tools must be created in cooperation between both commercial and academic parties.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Recommendations for research result materials 50 kEUR funding organizations, universities, NEALT working groups
Joint effort for standardization 15 MEUR universities, industry, NEALT academia/industry collaboration
Basic technology research 15 MEUR universities joint programme, researcher exchange, workshop, division of research tasks
R&D Funding 50-80 MEUR universities, research institutes, industry Nordic projects

The R&D funding can be further specified into various fields of services and applications for the society.

LT Training and Education

As a part of the Nordic Language Technology Research Program 2000-2004, a LT documentation centre was established in each of the five Nordic countries. Some continuation for them is needed, either in conjunction with some world-wide effort such as the LT world or as a Nordic or Nordic-Baltic effort. More cooperation is needed in academic training among the universities in the Nordic/Baltic region. A sufficient number of highly skilled PhDs and Masters ought to be trained with the best possible LT skills and all countries and language groups should be participating, including minorities and small language communities.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Nordic LT documentation 1 MEUR NMR network of LT documentation centres
NEALT Journal start-up 50 kEUR NEALT, Nordisk Publiceringsnämnd scientific electronic journal
Coordinated PhD education 1 MEUR Nordic/Baltic universities NGSLT
Master's level education 2 MEUR Nordic/Baltic universities distance education, exchange programs for teachers and students, common curriculum
Distant learning courses for commercial developers 50 kEUR Nordic/Baltic universities production of the material
Popularization 1 MEUR R&D, Government, Industry, Secondary Education professional PR assignment

LT Legislation

The development of LT tools depends on the availability of language resources such as corpora. Current copyright legislation makes the collection of resources unnecessarily difficult and costly. Certain privileges are currently granted to a few national libraries for archiving electronic copies of books, journals etc. and similar privileges are needed for creating LT resources. The legislation should be changed so that collecting, annotating and sharing of text and speech corpora for the purposes of research and development becomes easier. The use of such corpora should be deemed to conform to the principles of copyright when excluding republication. Changing the copyright legislation would make collecting corpora more productive by guaranteeing that corpora and annotated material are available for research and development purposes. Availability can be achieved either by allowing centres (such as national language banks) share materials with each other or by allowing individual researchers to share them.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Preparation of changes in the legislation 10 kEUR relevant ministries, universities, NEALT working groups

LT Business Aspects

The licensing conditions of LT resources must allow and encourage both their commercial and academic use. Medium term applied research projects together with industrial partners should continue. Funding should be provided for creating and purchasing LT applications and services for the public sector. This funding is intended to stimulate the LT service and application market uptake. Such services could include more ambitious goals using LT-enhanced applications.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
LT module uptake 5 MEUR industry and universities action plan managed at Nordic level
Web services 5 MEUR industry and universities academia/industry collaboration

Action plan

The aim of the report was to identify key areas, magnitude of funding, parties involved and modes of cooperation. However, we are still left with questions regarding further specification of the plans as well as priorities and time-frames within the 10-year period. Some answers have been sketched for the organization of the work, but more detail is needed as well as some further consideration of the division of national and Nordic funding. To implement the goals and to further specify the areas and their time-frames in the 10-year plan, we suggest the following steps in allocating resources:

  1. Establishing NEALT and its working groups
  2. Commissioning BLARK reports for the Nordic languages
  3. Nordic funding for cooperation on LT training and education
  4. National funding of medium-term applied research projects involving university and industrial partners

When the BLARK reports have been delivered, resources coordinated by NEALT should be allocated for

  1. Nordic funding of LT tools according to the recommendations of the BLARK reports
  2. Nordic and national funding of corpora, treebanks and lexicons based on the BLARK report recommendations

