CLARIN Technical Infrastructure

Peter Wittenburg - 3.2.2008

CLARIN is devoted to establish an integrated and interoperable research infrastructure for the language resources and technology (LRT) domain. The goal is to make language resources and technology much more accessible to all researchers working with language material, in particular in the humanities and social sciences. Building such an eScience enabling infrastructure requires investments at various layers - an important one is to establish its technical infrastructure. This needs to help overcoming the large fragmentation that researchers in particular in the humanities and social sciences suffer from when they want to work with language resources. Only little is ready to be used by other researchers than the creator and even less fits together so that in general non-expert users cannot easily combine resources and tools from different projects to tackle new research questions.

A suitable technical infrastructure will allow users to deposit and register their LRT components and in doing so to make them visible and accessible to others, to search for all types of resources and tools that are accessible, to create their own virtual collections with resources from different repositories to work on, to apply language technology tools to solve their specific problem and to combine existing language technology to new more complex operations including components from different developers.

Roughly we can differentiate two layers to work on when building such a technical infrastructure: (1) the integration layer that helps overcoming institutional boundaries and (2) the interoperability layer that helps to overcome the problems created by different encodings, structures and vocabularies. CLARIN will set up a network of repositories and service centers where users can deposit and register their resources and that will help turning language technology into usable services. The joint registry domain will allow users to look for LRT components that are relevant for their research. It will also help building virtual collections and storing their context so that they can be re-used. Federation middleware will guarantee that users can use their home identity to access the collection given that they received access permissions. One single login should be sufficient to access the virtual collection despite the fact that the individual resources probably originate from various repositories. Since users will create a virtual domain full of links for different purposes, we need to associate unique and persistent identifiers (PIDs) with all resources. These PIDs will also help to distinguish between objects and their various instances at various locations that may have been created for preservation or load optimization purposes for example.

If a user needs to access and combine the various components CLARIN needs to offer services that help to overcome the interoperability problems. These services will range from conversion services that allow to transform source formats into more generic standardized formats overcoming structural differences and terminology services that will help to overcome the semantic differences, to workflow services that will allow users to combine tools such as taggers and parsers to more powerful workflow engines. Standards need to be applied and new standards will be required to achieve the intended high degree of interoperability. CLARIN can rely on previous standardization efforts by initiatives such as EAGLES/ISLE, W3C, TEI and ISO TC37/SC4 and on years of experience in some areas.

To achieve a scalable and distributed infrastructure that also will take care of discipline specialities we will rely on a service oriented architecture, a flexible registry system and a network of strong centers. Distributed architectures are vulnerable and thus all services need to be based on trusted certificates that will be issued in close collaboration with the national network and grid authorities. The centers need to guarantee persistency and availability of the services so that researchers can be convinced to rely on them. In addition centers need to organize their services in a way that no burocratic hurdles are established. Such a technical infrastructure will only work when an organizational framework has been established and when the issuing of access rights is simplified. These aspects will be discussed at other places.

Topic revision: r3 - 2008-02-06 - DieterVanUytvanck
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback