Requirements Documentation DRAFT

This document offers the requirements documentation for the development project of AAI for Finnish language resources.



CO Copyright owner
CP Content provider, who acquires linguistic resources and sufficient rights to use them from the Copyright owner.
Database MySQL Language Bank database, future references to the database will refer to this MySQL database in this document
HAKA Identity federation of the Finnish universities, polytechnics and research institutions
IdF Identity federation
IdP Identity provider
LRT Language resources and technology
SP Service provider
SUI New Scientist's interface

Common features for Automatic and Controlled authorization

Shibboleth authentication

Shibboleth authentication means here the HAKA authentication for users of Finnish universities, polytechnics and research institutions. Shibboleth authentication is a common feature preceding both the Automatic and Controlled authorization.

  • Available linguistic research resources (current www location)
  • HAKA as Identity Provider Federation (Haka pages)
  • HAKA login (WAYF Service, later to be replaced by Shibboleth2 Discovery Service)
  • Provided attributes (funetEduPerson schema)
  • In the CLARIN community, the ePPN attribute is currently seen as the minimum necessary attribute. The rest is dependent on how well the attributes sets and their semantics can be harmonized, something we hope will happen via the eduGAIN 3.0 project.



Linguistic resources (corpora) have to be equipped with access information divided into three categories:

(1) LRT which can be freely used by anyone (including resources with open licenses such as Open Access etc.) Whether there will be resources falling in this category must be studied. LRT to which the CP can grant an access automatically if the user has an affiliation with an IdP.

(2) LRT to which the CP can grant an access automatically pending acceptance by the user of Terms and Conditions attached to the corpus/resource. Failure to accept the Terms and Conditions will prevent continuation of the resource access process - One-sided: commitment by user to predetermined CP terms. See Automatic authorization

(3) LRT which can only be accessed according to an individual application by the user and after (any) individual consideration by the CP - Two-sided: commitment by user to terms and permission by CP. See Controlled authorization

Terms and Conditions

(a) Terms of Access

A description of all the requirements that the applicant has to satisfy in order to gain access

(b) Terms of Use: Code-of-Conduct/License Agreement

Here the alternatives are either some of a very few general research purpose EULAs that the applicant might already have signed, and which will apply for the resource in question, or a resource-specific license agreement that the CP provides

Must be specified later.

Language selection: Finnish/English

  • There are three different electronic application forms, both in English and Finnish.
  • Emails will be bilingual (English and Finnish).
  • CSC will implement an architecture that will support the addition of more languages in the future in a relatively easy manner.

Loading linguistic resources

Process for the Language Bank Administrator to add resources to a server will be specified later. Whether CP or other people can upload resources must be studied, there may be safety and copyright considerations.

Monitoring and statistics

CSC will monitor and gather usage statistics.

Automatic authorization

Figure: User process for linguistics with Shibboleth authentication and automatic authorization

  • After Shibboleth authentication (Shibboleth authentication at the user's home organization is required)
  • Available linguistic resources of Category 2 with limitations defined by the CPs (for example, the resource can be accessible to users from University A only). See Resource Manager documentation below. How will the user view a list of resources before/after being authorized?
  • Acceptance by the user of Terms and Conditions attached to the resource is required. Whether automatic access on the basis of university affiliation alone may be given for research and education purposes must be studied.
  • After being authorized, the user can download the resources of Category 2 and save them to his/her own machine, but cannot execute applications (Lemmie or DMA) or log in CSC's computing environment.
  • User information will not need to be saved in the database nor will the user need a CSC user account.

Resource Manager Documentation

The term Resource Manager is used for compatibility with the CLARIN Language Resource and Technology Federation document, in which the topic 5 (Requirements) leaves resource management to the centers. The Resource Manager is the authorization component that automatically allows or denies access to files according to user attributes. A linguistic resource or corpus can contain one or several files.

  • Required attributes (per resource or file) must be solved.


In the Demo, a database which can store stat and hash data of files in a relational database was implemented to control access. The URL of the demo is The MySQL Language Bank database will be the actual platform.

The source code of the demo is attached:

  • dl: Python program to download files
  • list: Python program to show the allowed files

The current database structure includes a table called resurssi > resourcedetails:

describe resurssi;
| Field      | Type                  | Null | Key | Default | Extra |
| path_hash  | varchar(32)           | NO   | PRI |         |       | 
| path       | text                  | YES  |     | NULL    |       | 
| path_utf8  | text                  | YES  |     | NULL    |       | 
| moderator  | varchar(64)           | YES  |     | NULL    |       | 
| right_type | mediumint(8) unsigned | YES  |     | NULL    |       | 
| rights     | varchar(255)          | YES  |     | NULL    |       | 

Each record in the resurssi table contains a file description. A resource can contain one or several files.
The path_hash is an index and ensures the security of the demo system. It's generated by a python md5 object by the command, where realname is realpath(join(root, name)).
The moderator is Shibboleth EPPN (EduPersonPrincipalName). The moderator can set the rights.
Only path information is shown to the user.
Only right_type 0 is used.
The rights field contains a Shibboleth attribute key value string. The rights field can contain one of the following sample strings :
The program list only shows the user the files that the user has the right to access. The list of files has links to the dl program, which can send the requested file to the user if the rights allow sending.

Required features for production

We recommend that the Resource Manager model described as Demo will be chosen for production to grant automatic authorization to the chosen resources. In addition to the features of Demo, the following features are needed for production:

Option for setting rights (to be specified)

  • CP sees a list of all of his/her files/resources.
  • Moderator (e.g. the Language Bank Administrator) can edit the rights field on behalf of the CP. * this should include some (limited) prescribed usage right types, see Resource categories. Moreover, this option should allow for the deposition of the specific terms for usage which the applicant may sign electronically.

Other required features for production

  • MySQL Language Bank database will be used instead of the current demo database.
  • Adding AND and OR operations for the rights, may be implemented as a new right_type 1 or just by adding some parsing for right_type 0. This will be specified later. This may be implemented in the new MySQL Language Bank database.
  • Really carefully planning the database structure.
  • Recursive views and functionality per resources for all subdirectories and files under them like unix chmod -r. Is there a need to be denied access to certain files within a resource?
  • Showing the user the file sizes and adding the size information into the database.
  • An interface to add resources to a server, planning and implementation, may be a command line program because linguistic resources are static.
  • Usage statistics (they are already httpd server log published by analog, but is it enough?).
  • Groups. Groups are functionally equal to an OR operation for the list of users, but long lists are more efficient and user-friendly for storing their own tables.

Controlled authorization

Figure: User process for linguistics with Shibboleth authentication, electronic applications and referees

Electronic application form processing

There are three different electronic application forms, both in English and Finnish. There are also forms for CSC's internal use to follow the application workflow status. Each form requires a program to handle it. Also the email responses require handling.

Commercial users need to contact CSC sales and sign a contract to access the resources. In the Language Bank the following types of licenses are currently available: A License (Academic License) and B License (Extended Commercial License).

After the Shibboleth authentication:

  • Electronic Application Form (as Shibboleth authenticated and prefilled)
    • Required attributes
  • Available linguistic research resources (with limitations defined by the CPs) How will the user view a list of resources before/after being authenticated and authorized?
  • Acceptance by the user of Terms and Conditions attached to the resource is required.
  • Send

If the user cannot be authenticated with Shibboleth:

If the user already has a CSC account, after logging onto the CSC Scientist's Interface (

  • Electronic Application Form for CSC users
  • Personal and project information update (if needed)
  • Available linguistic research resources (with limitations defined by the CPs)
  • Acceptance by the user of Terms and Conditions attached to the resource is required.
  • Send

After Send:

The electronic application form can also be used to collect new referee information. Each electronic application form can contain a checkbox for the referee candidate to express his or her willingness to function as a referee. It will be clearly indicated on the web page whether user access or referee promotion is being applied for.

Referee's authorization

An applying user becomes trusted by being approved by a referee. A new electronic form with a referee list is needed in English and Finnish. The form requires a program to handle it. The response to the email sent by the system will be via a web form (not by replying to the mail).

The referee's procedure to authorize and authenticate an applying user could be the following:

  1. The user is forwarded to the Referees List Form containing a list of referees ordered by country (ref. Referees table). Some applications may skip the referee procedure.
  2. If the user expects that a referee knows him or her, he or she selects that referee. A notification of an application will be sent by email to the referee with links for recommending and denying and authenticating, if the user cannot be authenticated via Shibboleth. A timer-process has to be initiated when the email is sent to the referee.
    • The referee candidates select a referee, too.
  3. If the referee recommends that the application be accepted (ref. Recommend and Deny), the application will be forwarded to the CP and the Language Bank administrator to be accepted.
    • If the user does not know any referee, the application will be forwarded straight to the CP (or contact person) of the corpus and the Language Bank administrator.
    • If the referee selects the deny link, a rejection message will be sent to the CP and the administrator.

In this model, a referee losing status would affect the associated users as well. Loss of status due to natural reasons (e.g. retirement, transition) lacks this effect.

Authentication of the user by a referee

A referee can authenticate an applying user if the user cannot be authenticated via Shibboleth.

The referee should undertake to accept responsibility for a applicant by first agreeing to, for example, the statement given below, when authenticating to an applicant: "I (the referee) confirm that I have satisfied myself as to the identity of the applicant by checking his/her official identification (photo id)."

Recommend and Deny

For the referee recommendation, the system has a secret passphrase which is SHA-hashed with the applid value. There are two web programs: Recommend and Deny. The email message generated for the referee contains links to both of them together with the application data. When the referee recommends that the application be accepted, he or she clicks the recommend link that is parametrized with an applid and a SHA hash value, and the hash will be checked.

When the hash matches, the recommend program increments the CAC field value by 32.

If the referee fails to reply in e.g. one week, he will receive a reminder. If the referee still fails to reply, the application will be forwarded to the CP and the administrator after a predefined delay (e.g. one week).

If the hash does not match, the programs do nothing or warn the staff about abuse.

Emails of the referee procedure

  1. Referee Form sends an email to the referee for recommending or denying.
  2. Reminder email to the referee, if (s)he fails to reply (automatically after a delay).
  3. Referee's Recommend email to CP.
  4. Referee's Recommend email to administrator.
  5. Referee's Deny email to CP.
  6. Referee's Deny email to administrator.
  7. Referee's No reply email to CP (automatically after a delay).
  8. Referee's No reply email to administrator (automatically after a delay).

In the case of a referee candidate, emails to the CP will be replaced by emails to the nominator from the Helsinki University Department of General Linguistics.


Timer-process has to be initiated when the email is sent to the referee. Time limits can be adjusted as desired.

  • if the referee has not answered in a certain time (reminder)= 8 days
  • timer will expire after a delay = 15 days (email will be sent to the CP)
  • timer will be cancelled if the referee sends Recommend or Deny

Web forms (AA work flow)

The web application can process web forms, i.e. webform submissions. These webforms could include a text box field if there is a need for the referee to provide comments, and also a checkbox field (size to be decided later) if there is a need to flag that the candidate’s application should not continue to be processed automatically and should be subject to a further administrator decision. (This should eliminate the need to deal with spam email if no email addresses are used in the application.) (A CAPTCHA test could also be used on the web form.)

CP's and administrator's acceptance

If both the CP and the administrator accept the application (ref. Accept and Reject), the user will receive the access with the required permissions. Despite being rejected by the referee, the administrator still retains the option to accept the application, providing the CP agrees.

After the CP's and administrator's acceptance, all information will automatically be copied to the database tables User, Address etc. The CSC user manager process will create a new CSC user account with the appropriate rights and associate the new customer with a new or existing project. CSC's current UNIX/LINUX based environment uses unix groups for user management (e.g Lemmie and DMA). The CSC user account allows command line access to a server. Opening up a normal CSC user account would offer tools for monitoring. If the IdM system is running, it could create the account.

If the user's home organization is a member of Haka, (s)he can log onto Scientist's interface using the username and password issued by his/her home organization. During the first visit the user is also asked for the CSC user account, so that user's ePPN can be linked to CSC user account. The next time the CSC user account will no longer be needed to log onto CSC Scientist's Interface.The user can then log onto CSC Scientist's Interface using a HAKA login or CSC user account login to access the resources.

The referees will be nominated by the Helsinki University Department of General Linguistics and the Administrator.

Accept and Reject

The program then sends the application by email to the CP (or contact person) of the corpus and the Language Bank administrator to be accepted. If both accept, the Accept program copies the application data into the database tables kayttajat (users), osoitteet (address) etc., and sends an acceptance email to the user.

What else does the Reject program do other than send a rejection email to the user? Will the application be deleted?

Emails of the CP's and administrator's procedure

  1. CP 's Accept email to administrator.
  2. Administrator's Accept email to (save the user's data in the database).
  3. Accept email to user.
  4. CP's Reject email to administrator.
  5. Administrator's Reject email to user.

In the case of referee candidates, CP's emails will be replaced by emails of the nominator from the Helsinki University Department of General Linguistics, who accepts new referees.


Timer-process for the CP and administrator has to be initiated after referee's response. Time limits can be adjusted as desired.

  • if the CP or administrator has not answered in a certain time (reminder)= 8 days
  • timer will expire after a delay = 15 days (Reject email will be sent to the administrator)
  • timer will be cancelled if the CP or administrator sends Accept or Reject

Database changes

These changes will be made in the MySQL Language Bank database.

Email confirmation field

The application table has the email confirmation field which contains at least 128bit of random data generated when storing the application form. When using the application form as non-registered, the random data value will be emailed to the user. The user will receive a link to the confirmation form, where he or she needs to confirm the e-mail address by entering the random data value (refer to the KITWIKI registration). Submitting the confirmation form increments the CAC field value of the application table by 1 or 2 depending on the email address.

If the user's email address is invalid, the unconfirmed application will be dropped from the database once a day.

CSC Authentication Classes (CAC field)

The CAC field in the application table describes how the user's identity is verified. Information on how each user is authenticated needs to be stored in the database, because stronger authentication than the currently used personal signature may be required. It should be added into the database table kayttajat (users).

  • The minimum level of trust for authentication (expressed by CAC values) is 32.
  • Required CAC values per resource have to be defined.

The CAC field can get one or several of the values listed below. If several values are selected, they will be summarized.

  • 0. Not authenticated (data stored from web form).
  • 1. User-verified email. Authentication by an email confirmation from any address.
  • 2. Organization-verified email. Authentication by an email confirmation from a well-known CSC customer organization. HAKA members and state institutions can be considered as well-known CSC customer organizations.
  • 4. Authentication using a credit card or a good certificate issued by well-known CA.
  • 8. Scanned signature in a pdf-document.
  • 16. Personal signature (default value for current CSC customers).
  • 32. Referee recommendation: a known professor or research director recommends that the application be accepted. In addition, official identification (photo ID) can be verified by a referee.
  • 64. Strong authentication using SAML2/Shibboleth or grid certificates (in the USA: urn:mace:incommon:iap:bronze).
  • 128. Official identification verified by a bank account (tupas) or more secure certificates (in the USA: urn:mace:incommon:iap:silver).
  • 256. CSC-checked official identification card or passport.

Application table

When the user sends the application, what to do with the application data which is not yet accepted? It can be stored in the existing tables with new status fields, or new table(s) can be created. We recommend that a new application table be created.

field type size null comment
applid int no
arrivaldate date no
usernamecantidate varchar 8 yes
CAC smallint no
display name varchar 20 no
familyname varchar 25 no
nationality smallint no phone code or TLD
position varchar 40 no
organization varchar 40 no
faculty varchar 40 yes
phone varchar 20 yes
gsm varchar 20 yes
email varchar 60 no
emailconfirmation varchar 20 yes only needed during confirmation process
referee smallint yes
datetime datetime no
projectname varchar yes
projectdescription text yes
newreferee char 1 yes

  • A postal address is required for sending the password, magazines and Christmas cards. Fields must be rechecked.
  • Will the applied resources be stored here?

Referees table

CSC has to add the new table referees in the database. The referee table must have the ID and status fields. The ID field is just a number which connects the table to the henkilo (person) table (includes e.g. first name and last name) and to the osoitteet (address) table (includes e.g. email, phone etc.). The status field can have the values 0 (no longer trusted), 1 (active) and 2 (retired).

It is necessary to document who was the referee for each user. The referees table needs to be connected to the kayttajat (users) table by adding the ID field of the referee table into the kayttajat table.

field type size null
ID smallint no
status char 1 no

Topic attachments
I Attachment Action Size Date Who Comment
PNGpng linguistics_user_controlled_process_draft.png manage 126.3 K 2009-05-28 - 16:20 UnknownUser Controlled authorization drawing
Edit | Attach | Print version | History: r34 | r30 < r29 < r28 < r27 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r28 - 2009-05-29 - SatuTorikka
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback