PWP 3: Building the prototype (an initial operational CLARIN with data and tools)

PWP 3a: Language Resources

Draft: Erhard Hinrichs

One of the aims of the preparatory phase of the project should be to develop a prototype of the CLARIN infrastructure.

  • A running prototype can be considered as a proof of concept on which all activities are grounded. It is therefore important to have it as soon as possible; it helps us to identify the major risks and impacts of the project in its early stage;
  • a prototype which is realised for a small subset of the languages allows us to better estimate the costs and efforts for realizing the complete infrastructure;
  • a running prototype which is as close as possible to the final infrastructure will help us validate that infrastructure by evaluations which involve representatives of the prospective users of the resources.

This workpackage is therefore closely related to specifications of the infrastructure in workpackage 2. It is also related to workpackage 4, community networking, as the prototype has to take into account typical user scenarios and use cases. Surveys of the primary target groups will influence the specification. The functionalities which are needed for these use cases and target users will be given priority in the design and implementation of the prototype.

In the following, we will describe the integration of language resources into the prototype.

Listing of available language resources within the EU

For all of the languages involved in the project the resources available should be mapped to the CLARIN language resource type scheme. The application of this scheme to individual languages will reveal gaps in the "language resource map" of these languages.

The CLARIN consortium sees it as one of their missions to raise the awareness of these gaps in the language community and to initiate efforts to close them.

Besides the core resources for each language, which will be catalogued and made available by the CLARIN project through a common portal, we expect to find for each language various resources which do not fit in the scheme or are too small or specialized to be of interest for a larger community. It is however interesting for some users to find these resources and it might spare them costs and efforts for own development. The project will therefore establish a language repository, which will extend the core CLARIN language resource type scheme and list at least metadata and documentation for these resources.

We consider it to be useful to include one representative of each of the language families: Slavic, Romance, Germanic, Semitic into the prototype in order to detect difficulties which are due to the linguistic peculiarities of these language families.

Integration of language resources into the CLARIN prototype

The prototype implementation will be based on a technical architecture with distributed resources with central access through a portal and repository. This architecture is described in more detail in section XXX of this proposal. Concerning the language resources, the implementation of the prototype requires the following activities on the conceptual layer:

  • Localizing the language resource, clearing of intellectual property rights and access restrictions;
  • Describing the structural properties and the data categories used for the resource;
  • Retrieving and describing metadata for the resource; providing these metadata in a standardized format. The metadata will be stored on site and in a central repository;
  • Preparing the language resources of the chosen languages to a certain level of agreed upon uniformity of structure and annotation, as far as the linguistic details of the involved languages allow this. In addition to a conversion of language resources towards CLARIN format specification it should be taken care that the original format of the resources are preserved.

On top of the conceptual layer, the presentation of the resource will be designed to conform with a common design for CLARIN resources. This includes the analysis of the current resource presentation and its adaptation on the following levels:

  • Languages of the presentation pages: there should be at least one access page in the language of the resource(s) and one equivalent page in English;
  • Listing of contents and services on the presentation page;
  • Navigational structure; this structure should be designed according to standards and conventions of website usability;
  • Metadata structure and retrievability;
  • Types / formats of search results;
  • Export of retrieval results and integration into other applications;
  • Support of users and contact data.

This will be a step towards a Corporate Identity of the CLARIN project and will facilitate the use of language resources from various languages and sites.

A further important issue of the prototype building is to determine the methods and tools of access to language resource and extraction from them. There are several aspects which have to be taken into account:

  • Access, retrieval and extraction tools are different for each resource type. The access to lexical resources is more conventionalized than access to (annotated) corpora. It is therefore easier to supply a base functionality on which the user community will agree;
  • the selection and use of these tools depends on the functionalities which are required for our reference usage scenarios;
  • the available tools depend on the structure of the resources for which they are implemented. Therefore, the standardization of resource encoding formats has impacts on the access tools. If necessary, these tools have to be adapted.

The ideal situation with regard to the harmonization of the infrastructure would be to have one mode of access and query language for each type of resource. We assume that this aim will not be achievable, at least in the preparatory phase of the project. We therefore will first try to a) survey and assess the existing access and query tools; b) harmonize the documentation of these tools and c) to flatten the learning curve for each of these tools by demonstrating their formats and use for a set of prototypical use cases and queries.

Tasks and deliverables

Task Deliverable
Providing access methods for distributed LRs Implementation
Definition of the LR-subsets and samples to be included in the prototype. Main Distinction: Languages and LR-Types List of LRs
Definition and Implementation of conversions from original formats into CLRT implementations
Prototypical Conversion of a large number of LRs from one selected language implementation
Definition and description of the conversion procedure Technical Documentation
Managing and supervising the resource conversions of the other Languages  
Analysis of websites which allow access to language resources Report
Guidelines for the design of CLARIN-conformant access websites Report

PWP 3b: Language Technologies

Dan Tufis

The development of language technologies will be focused towards serving the basic multilingual, multicultural and multidisciplinary functionality of CLARIN infrastructure.

One example of such a technology would be an automatic classification system of the language resources according to the CLARIN language resource type scheme able to detect annotation gaps, and thus processing tools required to complete the encoding.

As a general approach, the NLP tools will be implemented, documented and registered as web services, using grid technologies and good practices.

BLARK and ELARK (as well as the required LR for each of the basic or extended types of processing) will be implemented following a “lego-like” philosophy with user configurable processing chains of linguistic services. While language web services should be available at the smallest possible granularity (tokenization, lemmatization, POS-tagging, chunking, named entity recognition, parsing, generation, acoustic modeling, language modeling, translation modeling, word sense disambiguation, etc.), the CLARIN LT infrastructure could offer some predefined processing chains (e.g., tokenization+POS tagging+lemmatization+chunking) or even complete applications, build from the existing building blocks (e.g. spelling/grammar checkers, automatic diacritics restoration, document classification systems, semantic-based information retrieval systems, dictation systems, speaker identification, reading aloud systems, question-answering systems, machine translation systems).

Besides basic and integrated language services, the CLARIN LR infrastructure should include tools for LR evaluation and validation as well as for quality assurance of the provided services. These instruments are usually tailored on specific types of resources (lexicons, corpora, parallel corpora, grammars, ontologies, etc.) and they should be aligned to the CLARIN language resource type scheme.

The LT platform should also include specialized annotation tools, aware of the CLARIN-adopted standards and good practices, with friendly interfaces, easy to understand and use, with minimal (if any) training, by the people in the humanities.

The CLARIN LT infrastructure should be based (as much as possible) on fast language-independent technologies (e.g. finite state technology, statistical modeling and processing,) and rely mostly on machine learning techniques and data-driven approaches, aiming at applicability to all the languages concerned, without (or with minimal) language-specific code adaptation.

This work will be carried out in close cooperation with the work in WP2, WP3a WP4 and WP5.



* (draft of 3a) wp3.pdf

-- LotharLemnitzer - 08 Feb 2007

Topic attachments
I Attachment Action Size Date Who Comment
Microsoft Word filedoc PWP3b.doc manage 26.0 K 2007-02-20 - 13:45 UnknownUser Just a sketch and ideas
PDFpdf wp3.pdf manage 65.9 K 2007-02-08 - 17:03 UnknownUser  
Topic revision: r14 - 2008-11-07 - HennaRiikkaLaitinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback