PWP 2b: Standards for Language Resources

Draft outline

Erhard Hinrichs, Nicoletta Calzolari, Susanne Alt


A crucial strength of the Clarin Initiative lies in the numerous Language Resources (LRs) that have been developed for many of the official languages used in the EU. These resources include large text archives, large linguistically annotated corpora, treebanks, typological databases, lexical resources, terminological resources, ontologies and grammars.

The richness and diversity of these resources makes them very attractive for a wide range of disciplines in the Humanities and related branches of science.

One of the confining factors that has thus far impeded the optimal use of these resources has been the lack of standardized and commonly agreed upon encoding standards and encoding practices. This is particular true for resources whose development started prior to or independently of the emergence of standardized schemata for the structural and linguistic encoding.

Within the last twenty years several initiatives developed and distributed standards for Language Resources. The main rationale behind these efforts is the experience of the stakeholders that the production and maintenance of LRs in a high quality is an extremely laborious, time-consuming and, hence, expensive process. Reusing these resources in various NLP environments is therefore indispensable and the sustainability of LRs is more and more regarded as an important economic factor.

The implementation of standards helps achieving this goal since they are generally well documented, used by many LR-related projects and initiatives, and supported by different software tools. However, the variety of standards and initiatives shows that it is extremely controversial and a consensus in the community has not been reached yet.

Moreover, new standards are still emerging and are still being revised while language resources are built in parallel in many places all over Europe. Given the state of the art the following three goals for the preparatory phase of the Clarin Initiative appear to be the most crucial toward the longterm and ambitious goal of providing a unified Language Resource Infrastructure for all official languages in the European Union.

Survey of available LRs and description of their legal status

As a prerequisite for making available for the communities as many of the LRs as possible, a comprehensive resource survey needs to be conducted. Here we build on previous survey initiatives, like ENABLER and the ELRA Universal Catalogue, to mention just a few. The results of these surveys will be taken into accout, and many of their initiators are in the CLARIN consortium. Our survey needs to go beyond a mere enumeration of the already available metadata for each resources, since in many cases this information is incomplete and lacks detailed accounts of the annotation schemes and encoding standards used in the resource. With the long term Clarin goal of making as many LRs available as possible for the research community at large, detailed accounts of the intellectual property rights underlying each resource need to be collected. WP 7 is dedicated to the handling of legal issues, therefore we will establish a close connection between the two work packages at that point.

Survey of existing and evolving encoding standards

Given the various standardization efforts that are already underway, in particular with the standardization bodies ISO and W3C, as well as with many national standardization bodies, and in several scientific communities, the Clarin Inititative does not aim at developing additional standards. Instead, CLARIN intents (i) to enable and facilitate the application of existing standards whenever possible, and (ii) to liaise with standards-defining bodies and committees, in which several of the key researchers involved in the Clarin inititative already serve as members or coordinators. A detailed of past and present standardization efforts is given below. It needs to be emphasized that many of the activities are still ongoing and will require continuous monitoring and interaction by the Clarin Research team. In addition the results of these efforts need to be disseminated to all Clarin resource developers and providers, in order to be promoted by them in their communities, applied to their resources, and tested.

Specification of the LR Type Scheme based on a classification of resources

A central mission of the project is to classify language resource types so that a core set of language resources for each language can be defined. In order to get a more systematic overview of resources of a wide range of languages, we recommend to classify these resources according to a language resource scheme. As a backbone of such a classification, a taxonomy or ontology of language resource types will be developed. We propose to fix at least the upper level of this taxonymy as six major types of resources, i.e. archives, corpora, treebanks, typological databases, lexical resources and grammars. The use of a seventh category for those resources which do not fit in any of the former six categories should also be considered. Of course this is an issue which must be discussed and agreed upon by the CLARIN partnership in the preparatory phase. Once a taxonomy exists and is agreed upon by the partners, it will guide subsequent tasks. The language resource types which are part of this taxonomy should be identified for each individual languages. This will provide us with a systematic scheme of existing resources for each of the languages of the CLARIN project and help us to identify gaps. For some of the resources involved it is desirable not only to document them in their heterogeneity, but also to try to harmonize them. A uniform annotation would allow the both human and machine users to access the resources more easily.

In defining a Clarin Language Resource Scheme, the CLARIN project builds on experience and achievements of former initiatives, e.g. the definition of a Basic Language Resource Kit (BLARK) in the ELSNET project, and the survey and stocktaking of Arabic language resources in the NEMLAR project.

Former Stardardization Initatives

  • the EU Project MULTEXT / MULTEXT EAST established standards for the annotation of corpora
  • The ELSNET project defined a Basic Language Resource Kit (BLARK)
  • The European Expert Advisory Group on Language Engineering Standards (EAGLES, developed standards and recommendation in various areas of language engineering.
  • PAROLE's aim was "to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages." (
  • The Project ISLE (International Standards for Language Engineering) continued the standardization work of the EAGLES project. An additional aim of ISLE was establishing interrelations to similar initiatives in the US and in Asian countries.
  • The Project LIRICS (Linguistic Infrastructure for Interoperable Resources and Systems, supported the LR-related standardizing activities of ISO TC37.

Current Standards

  • ISO Standards
    • ISO 24611 Language Resource Management -- Morpho-syntactic Annotation Framework (MAF)
    • ISO 12620 on Data Category Registry (DCR)
    • ISO 24610-1 on Feature Structure Representation (FSR)
    • ISO 24610-2 on Feature System Declaration (FSD)
    • ISO 24612 on Linguistic Annotation Framework (LAF)
    • ISO 24613 Lexical Markup Framework (LMF)

  • Others
    • Dublin Core (DC)
    • OLAC (Open Language Archives Community) metadata standards, which builds on DC
    • IMDI ** INTERA Project
    • Text Encoding Initiative
    • ITS (W3C) ** CES/XCES (corpora)
    • Mate/Nite (multimodal resources)

These standards are currently in at different levels of development and acceptance by the communities. It is one of the tasks of the preparatory phase to keep track of that.

Tasks and deliverables

Task Deliverable(s)
Survey of existing and evolving encoding standards Report
Survey of available LR Report
Description of the legal status of all of Language Resources Digital Information Service (web-based)
Definition and Description of a Meta-Data Scheme Report
Development of a Classification Scheme for Language Resources Report
Specification of the Clarin LR Type Scheme according to the classification  
Formal Definition of the CLRTS for all of the Language Resource Types e.g. XSchemata Guidelines and & Report
Documentation of the CLRTS Report
Rules of Best Practice Report

The LR-related tasks of the WP "Infrastructure for Language Resources" are concerned with organisation of the collection of the LRs. The partners should be enabled to giving access to their LRs in a way, that allows for building up a distributed LR-repository along some general guidelines (WP 3).

