White paper

Current models

The Language Bank of Finland

User applications to the Language Bank of Finland are delivered via a web-based application form. Upon submission, the form sends the application to the Language Bank administrator by email.

Many corpora further require a personal permission from their owner. A single person can be responsible for a single language or even only for a single text within the language. The application form also sends an email copy of the application to the owner (or contact person) of the corpus the user has expressed interest in.

If both the administrator and the corpus owner accept the application, the administrator requests the CSC user manager to add the user to the group granting the required permissions. If the application is rejected, the administrator informs the user personally. As the number of corpora grows, the application form may have to be split into two or more parts.

CSC user account management

After the web-based application is approved, the user should sign and send a paper form. The paper form with signature is required to authenticate the user. After receiving the signed paper form, the CSC user manager opens for the new user a new CSC user account with appropriate rights. User accounts are administered manually by CSC. User information will be stored in CSC's customer database Askare.

Users gain access to the linguistic resources via CSC's web interface Scientist's interface, or by logging on a UNIX server where the resources are located. Resources represent sets of corpora or linguistic programs. In this context, users mean end-users of the resources. Technically, each resource corresponds to a UNIX group. Each user can be a member of several groups but each resource can have only one. The user may view and edit his personal information section via the Scientist's interface.


As the research network expands, previously used authentication methods become inadequate. The current systems in use within single countries are not as such suitable for international purposes. It is neither possible for administrators to reliably assess and monitor the users. Individual countries also have their own conventions and methods, making their systems incompatible with each other.

The growing number of users with various needs puts pressure on automatisation of user processes. Automatisation of user processes can raise the quality of service experienced by the users, save money by eliminating duplicate work, decrease the possibility of human errors, increase safety and give better tools for monitoring. CSC has already made plans for new web-based application forms and electronic user processes. CSC has purchased the Sun Identity Manager (IdM) software for identity management.

AAI Technologies

The currently available electronic technologies for Authentication and Authorisation Infrastructures (AAI) which CSC is involved are presented: Certificates, Saml2/Shibboleth and eduGain. Detailed information about the various AAI technologies can be found in the Report on comparison and assessment of eID management solutions interoperability.


Certificates can be used to identify a person or a server. A personal certificate is a file that uniquely identifies its owner. The certificates contain information identifying the owner of the certificate, the public key itself, the expiration date of the certificate, the name of the CA (certification authority) that signed the certificate and some other data. Server certificates are widely used to identify servers, e.g. Shibboleth or any SSL/TLS (HTTPS) servers.

The grids are based on X.509 Public Key certificates described by rfc3280. Some countries and universities have very secure identification cards with smart cards, but because they require a special smart card reader, they are only widely used in Estonia.


  • available to everybody for free (grid organizations, http://www.cacert.org/)
  • in commercial supply reasonably priced (for example, 1,5 ¤ from TeliaSonera for members of the Funet)


  • not widely used
  • contain only name and email information
  • difficult infrastructure, user certificates only used by the grid
  • difficult to use, one or two more passwords or pins
  • difficult to program, the most common C/Python-Application Programming Interface, openssl is not well documented
  • hardware ones are expensive (40 ¤) and require a reader (Windows-only driver) with about the same cost
  • trust issues

The organization accepting Public Key certificates needs a trust policy. The EUGridPMA (Policy Management Authority) is the international organization coordinating the trust fabric for e-Science Grid authentication within Europe. This table lists all members of the EUGridPMA. There is also a map.

TERENA has also a repository containing verified root-CA certificates called TACAR (TERENA Academic CA Repository). Note that no Finnish operators nor NorduGrid, used Nordic grid operations, are present. It will be very difficult to get TACAR-accepted certificates in Finland. TACAR is a list of certificates not known by browsers by default, which make them totally unsuitable web usage . CSC and HAKA use TeliaSonera certificates, because they are known by most browsers and using them costs much less than setting up an own Public Key Infrastructure.

Commercial certificates based on credit cards may not be vary reliable, because there is some evidence of a black market for stolen credit cards.

SAML2/Shibboleth federation

SAML (Security Assertion Markup Language) is an OASIS standard for exchanging authentication, access rights and attribute information in XML. Shibboleth is a SAML-based, open source software package for web single sign-on across or within organizational boundaries. Shibboleth version 2 is directly compatible with SAML2 version. A user in authenticated with his or her organizational credentials. A single password is required for multiple applications. In addition to providing single sign-on functionality, Shibboleth can control access to licensed resources.

Shibboleth has two major halves: an identity provider (IdP) that authenticates users and releases selected information about them and a service provider (SP) that accepts and processes the user data before making access control decisions or passing the information to protected applications. These entities trust each other to properly safeguard user data and sensitive resources.

The SP runs in Apache as a module or in IIS as a filter. The IdP is a web service written in Java and operates in any standard servlet container like Tomcat, which is a freely distributed www server. Unlike the better-known Apache httpd server, Tomcat is based on Java and implements the Java Servlet and JavaServer Pages definitions.

In addition, Shibboleth contains the WAYF (Where Are You From) server component. After connecting to a resource in order to gain access to it, the user will be redirected to the WAYF server to be authenticated at the user's home organization. The role of the WAYF server is to present a list of home organizations to the user. The user selects his home organization and is redirected to its login page. If the user is already authenticated at his home organization, he does not have to be reauthenticated. An example: HAKA WAYF server page.

For a series of technical explanations on how Shibboleth works, from easy to expert, refer to SWITCH Demo.

Shibboleth will release the user information called attributes. User attributes in the Haka federation are provided according to the funetEduPerson schema and test service. The funetEduPerson schema is compatible with the SCHAC schema.

The mandatory attributes are:

  • cn, commonName, displayName + sn
  • sn, surName, family name
  • displayName, used givenName, the name the individual has registered as the one (s)he uses
  • eduPersonPrincipalName, should be represented in the form user@scope, where scope defines a local security domain
  • schacHomeOrganization, specifies a person´s home organization using the domain name of the organization.
  • schacHomeOrganizationType, countrycode:string, fi:university

The schema also contains the funetEduPersonProgram, which is an educational degree program. Privacy is guaranteed by controlled Attribute Release Policies for the Haka federation. Adding new mandatory attributes is a very difficult and slow process and therefore only recommended in cases where additional information is truly vital.

Haka is the identity federation of the Finnish universities, polytechnics and research institutions. HAKA uses SAML2/Shibboleth technolgy. The federation consists of the users' home organizations providing the user identities according to common rules and contracts. The Haka federation is operated by CSC, the Finnish IT center for science. Haka has over 20 home organizations and about 50 services. A similar Higher Education federation in the USA is InCommon, which has an annual fee of $1000. Although the software encourages the federation model, bilateral agreements are, of course, possible.

The list of the federations and their status.

Kalmar Union is a SAML2 project that will connect the Nordic countries' academic communities to establish a Nordic cross-federation. Haka will most probably join the Kalmar Union in spring 2009 when the it is scheduled to be in operation.

CSC could easily integrate the Haka authentication in the electronic application form, where the application form would serve as a Service Provider.


The purpose of eduGAIN is to provide the means for achieving interoperation between different Authentication and Authorisation Infrastructures (AAI).

There are a number of AAI systems developed and used on the national (NREN, National Research and Education Network) level. Shibboleth (Internet2) is the federation technology used in the US, Switzerland, Finland, Germany, Great Britain, Hungary and Greece (under development). PAPI is used in Spain, A-Select in The Netherlands, simpleSAMLphp in Norway. There is also a RADIUS-based AAI used in Croatia.

In order to be granted access to protected resources and services from other federations, the users first need to be successfully authenticated by their home AAI and authorized by the visited Service Provider (usually based on attributes expressing a special role for the user). Some national systems lack the attributes or they are very few. eduGAIN provides the technology necessary for carrying out these steps and thus interconnecting different AAI systems.

eduGAIN is not yet a production-level service. The software is on the second release candidate level. The 1.0 version will be published soon. Only pilot projects will be possible before April 2010, when policy development is dated to be ready.

CSC has proposed to allocate ½ FTE per year in the upcoming GN3 project to work on Service Action 3 Task 4 of EduGAIN. HAKA specialists would like to continue contributing like they have done in GN2. GN3 project proposal is still under editing and the project is planned to start in April 2009.

Data access model

Shibboleth will be used to authenticate users of the Language Bank of Finland. Access to the resources will be granted with a single authentication transaction by providing username and password. The resources represent sets of corpora, linguistic programs or other permissions.

After the user connects to a desired resource, (s)he can access the resource automatically if (s)he belongs to a group that is allowed to access the named resource. If the owner of the resource has set limits for its use, the user can continue to the owner-controlled application procedure. A list of the available linguistic research resources will be displayed to the user, in which (s)he may select one or several. If the user cannot be authenticated with Shibboleth, (s)he can make a contract to access the linguistic services.

Automatic access to resources

Automatic access means here an attribute-based access to a chosen resource. After successful authentication at the user's home organization, the resource decides on granting or denying access for the user. In the background, the home organization has provided minimal user attributes to the resource, which it requires for the access authorization decision and for delivering its service. If the user is already authenticated, (s)he may access this resource immediately.

For example, a researcher from University A can directly access the resource, provided that the resource is accessible to researchers from University A. SAML2/Shibboleth can act as an identity provider (IdP), which authenticates users and provides attributes.

The automatic access procedure can have the following Shibboleth attributes (example):

  • User Pekka is a researcher in the Huippu university.
  • He wants access to =ResourceZ=.
  • Pekka belongs to the group huippu.
  • His user attributes follow the format HTTP_!SHIB_EP_PRINCIPALNAME, in this example HTTP_SHIB_EP_PRINCIPALNAME=pekka@huippu.fi.
  • The owner of ResourceZ has authorized the group huippu to access ResourceZ.
  • Pekka is granted access to ResourceZ.

User information will not need to be saved in the database nor will the user need a CSC user account. For monitoring, usage statistics may be generated.

Once a user is authenticated, he can access any other Shibboleth-enabled resources without entering his login name and password again, providing that (s)he is authorized to access these resources. It is only necessary to log in again if the user exits his web browser or if no Shibboleth resource is accessed for some time.

Figure: User process for linguistics with Shibboleth authentication and automatic access to resources

Owner-controlled access to resources

If the owner of the resource (e.g. corpus owner) has set limits to the use of the resource, the user may have to apply for authorization to access the resource. Also in this case, Shibboleth will be used to authenticate users.

The owner-controlled application procedure for a new user could be the following:

  1. The user first selects one or several resources from a list of available linguistic research resources. The resources have limitations defined by their owners. After the user connects to a resource in order to gain access to it, (s)he will be redirected to his home organization to be authenticated. If the user is already authenticated, (s)he does not have to reauthenticate.
  2. After successful authentication, an electronic application form will open for applying authorization to access the resource(s). In the background, the user's home organization has provided a set of attributes about him/her for the application form to prefill the form.
  3. Now the user will complete the application form to describe his needs. The purpose for which the resources are applied serves as the basis for the authorization. After the user has completed the application form, (s)he will select Send.

After the application is sent, it is forwarded to the referee's authorization (see below). Some applications may skip the referee process. The owner of the resource will finally decide on granting or denying access to the resource. If both the administrator and the owner accept the application, the user will receive the access with the required permissions.

The application data will need to be saved in order to forward it to referees and owners. For this purpose, saving the application data in the database is the best solution. Opening up a normal CSC user account would offer tools for monitoring.

Referee's authorization

An applying user becomes trusted by being approved by a referee. The referee can also identify the user. Referees will be nominated and referee lists maintained by the Helsinki University Department of General Linguistics. CSC's customer database Askare can store the referee lists. The members in the referee network share their knowledge about the users with each other in order to evaluate the incoming applications. Naturally, the referees need to be trusted as well. Information about the referee should be stored with application data. In this model, a referee losing status would affect the associated users as well. Loss of status due to natural reasons (e.g. retirement, transition) lacks this effect. The referee can also identify the user, if (s)he is not authenticated.

Referee's authorization procedure for a new user could be the following:

  1. The user is forwarded to a page containing a list of referees ordered by country: Does any referee know the user? Some applications may skip the referee procedure.
  2. If the user expects that one or two referees know him/her, (s)he selects that/those referee(s). The application will be sent by email to up to two referees with links for recommending and denying.
  3. If the referee recommends that the application be accepted, the application will be forwarded to the owner and the Language Bank administrator to be accepted. The referee can also identify the user if needed.
    • If the user does not know any referee, the application will be forwarded straight to the owner (or contact person) of the corpus and the Language Bank administrator.
    • If the referee selects the deny link, a rejection message will be sent to the administrator. Despite being rejected by the referee, the administrator still retains the option to accept the application, providing the owner agrees.

The same application form can also be used to collect new referee information. The form should contain a check-box for the referee candidate to express his/her willingness to function as a referee.

Figure: User process for linguistics with Shibboleth authentication, electronic applications and referees

Non-registered users

If the user cannot be authenticated with Shibboleth, (s)he can fill in the electronic application form as non-registered to apply for authorization to access the resource.

In this case, the email address of the user needs to be verified. If the user is authorized to access the resource, the referee or the owner can authenticate the user. Commercial users need to contact the Sales organization and sign a contract to access resources.


The cost/usefulness diagram below compares available AAI technologies: Certificates, eduGain, Saml2/Shibboleth and Referees.


Figure: Technology cost/usefulness diagram (Pekka Järveläinen, CSC)

The cost consists of the estimated amount of work required to set up the production service. The usefulness comprises quantity, quality and the scope of information about the potential users available. For example, the SAML2/Shibboleth technology grants a lot of high-quality information but only in the Finnish or Nordic scope, while eduGAIN hopefully covers most European academic users, but so much information cannot be expected. For this reason, a rough guess can be made that SAML2/Shibboleth and eduGAIN are equally useful.

The referee process's usefulness depends on the proportion of applications it scopes, the majority of users must be known by the referees to ensure usefulness. A well-organized referees group will be very useful and the technical requirements are quite low, consisting of only some web forms and data in databases.

Personal certificates will not be used.

Topic attachments
I Attachment Action Size Date Who Comment
PNGpng Perusprojekti.png manage 83.0 K 2008-07-03 - 08:29 UnknownUser Plan to set up new CSC project via web forms
PNGpng customer_process_draft.png manage 118.4 K 2008-08-26 - 13:54 UnknownUser Customer process plan
PNGpng linguistics_user_automatic_process_draft.png manage 38.4 K 2008-09-04 - 12:42 UnknownUser  
PNGpng linguistics_user_controlled_process_draft.png manage 127.8 K 2008-09-04 - 13:00 UnknownUser  
PNGpng techcostusefullness.png manage 14.8 K 2008-07-22 - 12:28 UnknownUser technology cost usefullness digram
Edit | Attach | Print version | History: r75 < r74 < r73 < r72 < r71 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r72 - 2008-09-15 - PekkaJarvelainen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback