Current situation in 2006

The copyright and other IPR legislation has been an obstacle for collecting research materials and sharing them for academic purposes. Schemes and model contracts exist for collecting text and speech corpora, but they are laborious to use and often limit the use of the materials. Some recent changes in copyright legislation have made it even more difficult to collect and digitize material (by forgetting research and develpment uses).

Patenting of computer programs and algorithms has become harmful for LT. Early publishing of research results and applying open source policies will help in part but do not fully solve the problem. Lots of careful study and new research is needed because some patents protect the most obvious ways to solve common problems. It is beyond the financial resources of researchers and the small and medium-sized enterprises to resolve software patent conflicts even if the patent is obviously invalid.


  • Current copyright law and IPRs are an obstacle to the creation of quality resources.
  • LT modules require complicated and costly licensing.
  • The tools for creating LT modules are difficult and costly to acquire.
  • Many development efforts are in stand still, as others will not or cannot develop proprietary resources or products owned by a competitor.

Vision for 2016

In 2016, there is legislation and an infrastructure where text and speech corpora can be freely collected, annotated and used for the purposes of research and development. The arrangements make it possible for any published source to be stored and processed for the purpose of creating research results and LT products without compromising the copyright of the source. In addition, patenting obvious ways of solving problems with programs is no longer possible, and such patents have been declared invalid.


The survival of cultures and languages with a relatively small number of speakers depends on the ability to use the language in daily life. This depends more and more on the availability of LT. The development of LT tools depends on the availability of language resources such as corpora. The copyright legislation should enable collecting, annotating and sharing of resources for research purposes. Currently certain privileges are granted to a few national libraries to archive electronic copies of books, journals etc. and similar privileges are needed for developing LT resources. E.g. the Finnish library for the blind has a privilege to make electronic copies of copyrighted materials for the purposes of that library. In a similar vein, it is recommended that the legislation be changed so that the collection of text and speech corpora for the purposes of research and production of LT tools is possible. The use of such corpus collections would be deemed to conform to the principles of copyright when no longer passages are republished. Changing the copyright legislation would make collecting corpora more productive by guaranteeing that corpora and annotated material are available for research and development purposes. Availability can be achieved either by allowing centres (such as national language banks) share materials with each other or by allowing individual researchers share them.

Key Area Magnitude of funding needed Parties involved Mode of cooperation
Preparation of changes in the legislation 10 kEUR Relevant Ministries, Universities, NEALT working groups

