Data Management Guidelines: FIN-CLARIN – Common Language Technology and Resource Infrastructure (CLARIN) in Finland

Data management plan (DMP)

1. FIN-CLARIN overview and division of responsibilities

FIN-CLARIN is a distributed infrastructure. The FIN-CLARIN partners handle production of language technology and resources relatively independently. As an infrastructure, FIN-CLARIN has three primary strategies to promote good data management: training, support, and infrastructure design. FIN-CLARIN as a consortium gives guidance and requirements for publishing datasets and open access publishing.

The data and technology produced by FIN-CLARIN can be divided into two parts:

  1. Data sets and technologies developed by scientific research projects of the FIN-CLARIN partners. These resources are owned and handled by the FIN-CLARIN partners. The partners are responsible for their own storage and open access according to their guidelines and policies. The resources are language-oriented, and are highly valuable from a scientific point of view.
  2. FIN-CLARIN internal data on infrastructure usage, including technical information and statistics. This data is relatively small in scope, and is handled by CSC – IT Center for Science Ltd. following the processes and requirements set by CLARIN ERIC and EGI (the European Grid Infrastructure). This quantitative data is valuable for further development of FIN-CLARIN.

The researchers and data owners using FIN-CALRIN have ultimate responsibility for the type 1 data management. However, FIN-CLARIN shares technical expertise and good scientific practices. The DMP allows FIN-CLARIN to reach a significant number of users within Finland. To this end, FIN-CLARIN makes information on data management a key part of its activities.

To facilitate the data management, FIN-CLARIN requires partners to follow open access policies, and provides this DMP with general principles within the infrastructure, and detailed Data Management guidelines that take into account the specific policies and environments of each partner and gives partner and discipline-specific advice.

2. Management of FIN-CLARIN data developed by partner research projects

In this section, we describe our support for managing the tools and data from research projects using the FIN-CLARIN infrastructure.

2.1. Existing data management policies and activities of partners

The FIN-CLARIN partners already have individual data management practices. Through the collaboration in FIN-CLARIN, the lessons from these programs can be spread among institutions in the same way as FIN-CLARIN supports knowledge transfer.

As FIN-CLARIN partners already have their own data management support and policies, they will be adhered to. These are published on the web and listed in Table 1. All of them take into account data management throughout the data lifecycle, from planning to archiving and reuse.

Table 1. Data management policies and guidelines of FIN-CLARIN partners

Furthermore, each partner will name a data management contact person, who is in charge of enforcing this plan at his or her location. They will provide partner-specific user support and training.

2.2. FIN-CLARIN Data management principles and guidelines

FIN-CLARIN is an infrastructure, and thus day-to-day data management must ultimately be done by end-users. In order to ensure that users follow the DMP, we recommend initial training in data management. For projects applying for restricted resources, we require an initial data management plan before access is granted. This ensures that users consider data management as part of their research.

Primarily, we recommend that users follow the existing partner data management guidelines with respect to openness and dissemination. When possible, we extend and improve these guidelines with focus on scientific research and seek to resolve any potential conflicts between policies.

The consortium expects compliance with the data management guidelines, and all partner sites provide support to their users of the infrastructure. We continually update the guidelines with current best practices and the latest recommended services.

FIN-CLARIN as an infrastructure covers the data mid-life cycle, i.e. the actual storage and computation. However, the FIN-CLARIN support services cover all stages of the data lifecycle. FIN-CLARIN is committed to open science principles and open publishing. The data guidelines further explain the recommendations of implementing this and how to leverage existing services.

FIN-CLARIN provides a landing page on data management with information on data management specific to language technology and resources. It contains both new information and links to CLARIN ERIC as well as national and partner-specific information, including local contacts. This information is available for central use as well as for partner-specific documentation.

The consortium publishes the data management plan as well as practical data guidelines and collate advice on data management practices on its wiki pages:

The FIN-CLARIN wiki pages also provide a forum for sharing best practices by the data management contact persons to support data management locally.

FIN-CLARIN recommends open science principles and open access publishing. For implementing this, FIN-CLARIN recommends using suitable existing services such as DMPTuuli for project data management planning, the upcoming national digital preservation service portfolio for research, as well as international services such as Zenodo and EUDAT. See Table 2.

Table 2. Recommended and supported data management solutions

All relevant publications must be reported according to each organization’s guidelines, ensuring they are sent to the national VIRTA publication reporting system. In general, this is done through the university reporting systems, which are also used for our internal reporting and acknowledgement of the infrastructure. All publications produced using the FIN-CLARIN infrastructure should include a reference to the technologies and resources provided by the infrastructure. The references to be used are the persistent identifiers given by the infrastructure through the reference service, e.g. https://www.kielipankki.fi/viittaus/key=RESOURCE&lang=en.

Research data that is prepared for sharing has to be stored either in the IDA service, or in a similar organizational/national/international archive. When choosing a storage and sharing service, the user must consider legal and ethical issues and ensure that the stability and availability of the chosen service are suitable for long-term storage. The service must also give a persistent identifier for the datasets so that there is a way to refer to the resource. We recommend that all datasets are stored in a service from which they can be shared and that they are licensed so that others can use them (e.g. Creative Commons licenses).

The choice of licenses has to be done taking into account legal and ethical aspects of the data. Software and databases have their own license recommendations. For open datasets, we recommend Creative Commons BY 4.0 and for open metadata CC0. The former requires attribution to the original creator and the latter waives all rights ensuring maximum visibility for metadata.

All tools and datasets must be described in https://metashare.csc.fi/ or https://vlo.clarin.eu. If a resource is described in another service, there must be a reference to either of these descriptions (for instance with a persistent identifier). The description of a dataset has to include administrative, technical and descriptive metadata according to current standards. The goal is to ensure good discoverability of the resource and adequate levels of information for others to evaluate the possibility for reuse. In addition to the metadata description, a link to the resource and possible license information has to be included. All resources produced using FIN-CLARIN should include a reference to the infrastructure service.

2.3. FIN-CLARIN-provided data management tools

FIN-CLARIN provides a variety of tools for integration with the data life cycle.

Each partner provides core day-to-day data storage for research activities. In general, this storage space is large and fast, but not backed up. A smaller home directory space is provided for backed up code and critical configurations. For large data storage, the partner storage locations must be used for back-up unless a separate agreement is made with CSC. Each partner offers integration with its local resources. Data can be stored both in individual user folders, or in group folders for collaboration. In all cases, data is protected by file system permissions.

FIN-CLARIN has installed the iRODS commands in its computing environment, which allow direct access to the IDA storage service. This allows direct staging to and from long-term storage. The ePouta cloud service provides several tiers of data storage: default, non-backed up disks, high-performance IO storage, and normal backed up storage.

2.4. Implementing data management training and instruction

By far, the hardest data management problems are on the end-user side, where we assist through our consortium training processes. Support is provided both nationally and locally, with the consortium serving as a conduit for best practices to be shared. Local support staff is able to provide the most useful support. The goal is to nationalize local best practices as well as promote CLARIN ERIC standards.

FIN-CLARIN, as part of its data management activities, organizes events for data management planning and roadshows to all the partners providing targeted training and offering support for the FIN-CLARIN partners, so that they can better support their researchers. The consortium makes use of the Open science training materials as well as CSC’s existing training framework including training on data science and data management.

3. Management of FIN-CLARIN internal data

In this section, we outline the data produced by FIN-CLARIN in the daily operation of its services.

3.1. Types of data

Primarily, FIN-CLARIN internal data contains status, usage, and job statistics. This is primarily useful for reporting and development of FIN-CLARIN services. Data is collected automatically by FIN-CLARIN services as a normal part of the usage of the FIN-CLARIN platforms and services. For example, the search tools contains records of all executed searches. This provides data automatically in a structured and interoperable form. All software and automated configurations are considered data.

3.2. Documentation and quality

Since the FIN-CLARIN centralized service setup is automatic, the software stack collecting the data is known and reproducible. Because all data comes from standard open source systems, documentation and structuring is automatic. We will prefer the standard forms from these systems when releasing the data, and defer most documentation to the authoritative upstream sources by linking. Data quality matches that of the CLARIN infrastructure: the data documents the actual performance of the systems.

3.3. Storage and backup

FIN-CLARIN operational data is backed up as a part of normal operations of systems of this scale. The total size of the data is small relative to the capacity of the FIN-CLARIN systems.

3.4. Ethics and legal compliance

The relevant operational data is non-personal and FIN-CLARIN can release it independently. Usage data may be released only in a sufficiently anonymous and aggregated form. Partners, in conjunction with guidelines produced by FIN-CLARIN, will conduct anonymization.

3.5. Data sharing and long-term preservation

The infrastructure data is reported as part of the annual reporting. Summaries are also included in FIN-CLARIN and partner reports.

Software and other code is made available from the CSC organization account under appropriate free software and open source licenses (e.g. MIT, GPLv3+).

A copy of the deposited item is placed in the backed-up long-term preservation system of the repository. The item is read from the storage from time to time to ensure that the deposited item is still accessible and readable with existing software. In case of difficulties, a recovery procedure is invoked.

4. References