HFST: Getting a corpus to Korp

This page tells how to install Korp to your own machine and convert a corpus to a format understood by Korp tools.

You probably need:

  • Kielipankki-konversio: (for converting your corpus to vrt format and) for converting the corpus in vrt format to format usable by korp backend
  • Korp backend: for performing searchs in the corpus
  • Korp frontend: for a graphical user interface that communicates with korp backend

Converting corpora to korp format with Kielipankki-konversio tools

See Kielipankki's and Språkbanken's instructions. Also see Kielipankki's technical instructions.

Kielipankki tools can be fetched from their Github repo. They depend on CWB tools (including cwb-perl) that must be installed before and made visible to Kielipankki tools with

export CWB_BINDIR=insert_cwb_installation_dir_here

When compiling and installing CWB, set variable PLATFORM in config.mk to unix.

See CWB pages to get also cwb-perl installed. In directory /korp/cwb-perl/CWB, run

perl Makefile.PL --config /usr/local/cwb-3.4.12/bin/cwb-config

If perl complains about missing HTML::Entities, run (as super user) cpan and execute

install HTML::Entities

NOTE: spaces in attributes can be problematic so avoid using them. Also & and < signs can be problematic, so they must be escaped as told in Kielipankki's technical instructions.

TODO: the following warnings should be handled, although they are not that dangerous:

korp-make-corpus-package.sh: Warning: Korp frontend directory not found
korp-make-corpus-package.sh: Warning: No readme file included
korp-make-corpus-package.sh: Warning: No documentation included
korp-make-corpus-package.sh: Warning: No conversion scripts included

Package is created in pkgs/CORPUSNAME directory.

Korp backend

Both Korp backend and frontend are based on Språkbanken's Korp tools. Fetch the backend from Kielipankki's github repo (private repository).

For dependencies, see Språkbanken's documentation. Note that the CWB dependencies are probably already met if you installed them for Kielipankki-konversio tools.

In korp_config.py, you probably have to modify at least variables

CACHE_DIR (can be empty string)

To make Korp backend available via a web browser (TODO: in which address?), you must start apache. Before running apache, execute a2enmod cgi to allow cgi scripts. The command will also symlink /cgi-bin/ to /usr/lib/cgi-bin/ (TODO: this assumes that you have korp on this directory, but it could and probably should be elsewhere). Also modify apache's configuration file (probably located at /etc/apache2/apache2.conf) so that it only accepts connections from localhost. This is done by changing Require all granted to Require local for directories /usr/share/ and /var/www/:

  <Directory /usr/share>
        AllowOverride None
        Require local

  <Directory /var/www/>
        Options Indexes FollowSymLinks
        AllowOverride None
        Require local

Also make sure that korp.cgi has rights to write to the log directory and file.

Make sure that the HOME and INFO paths are correct in file corpora/registry/CORPUSNAME after you have run korp-make:

# path to binary data files
HOME ...
# optional info file (displayed by "info;" command in CQP)
INFO ...

Korp frontend

Get Korp frontend from CSC's github repo. It is forked from Språkbanken's repo. See instructions and dependencies on Språkbanken's repo. Note that in 'Local setup for Ubuntu', the command npm install must be performed as super user, and the command sudo gem install compass must be run finally so that compass is found.

In app/config.js, you must set the URL's of the cgi scripts (/cgi-bin/korp/ by default) and locally_available_corpora (most corpora listed here will not be available). Also modify settings.corporafolders, settings.corpora and locally_available_corpora if you wish to add corpora.

Run grunt serve and go to http://localhost:9000.

-- ErikAxelson - 2017-09-19

Topic revision: r10 - 2017-10-02 - ErikAxelson
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback