Korp concordance analysis in R -- a plan

The plan is to inject concordances from the Finnish Korp into R in a form that is as easy to manipulate in R as possible, without losing information. Disregarding the desired ease, a concordance in a CSV (Comma Separated Values) format is already readable into R.

High level view of the plan

The trick is to help manipulate the concordance as a concordance that consists of metadata and a number of annotated sentences with specific structure. The structure of a sentence is three-fold:

  • it's a sequence of tokens
  • it's partitioned in three: context before, key tokens, context after.
  • it's a dependency tree (if annotated to that level)
And! each token is structured as a record of named fields.

So there is structure within overlapping structure within structure, which is not native to a simple CSV table. All this can be wrapped in a data structure in R. (And keep track of metadata.)

One other thing! There's also a desire to "read" such concordances into R straight from Korp, by sending a query and a receiving a concordance in return. (Does one write raw CQP for this? With a hand-written list of corpora? Does one authenticate oneself to be granted the authorization to access restricted corpora? In plain text? Never in plain text.)

Middle level view of the plan

Input consists of metadata of the query followed by individual sentences, each as their metadata followed by their annotated tokens where the key tokens are marked. This is to be transformed into a table of sentences, each consisting (abstractly) of

  • metadata (combining concordance metadata with sentence metadata),
  • context before,
  • key tokens,
  • context after.
These tables of sentences can then be manipulated as objects on their own right, with easy access to the relevant structure, or individual sentences can be reified as records that also provide access to their relevant structure.

Be guided by database theory (C.J.Date and Hugh Darwen, not SQL) to provide the following operations on such tables:

  • union (require same heading; maybe disjoint union? a proper union would be better)
  • extension (add new named columns of values, given a function from a sentence record to a number of labeled values)
  • selection (given a predicate on sentence records)
  • projection (keep or drop some named columns)
Possibly allow the renaming of columns in each such operation where it makes sense or seems useful. The operations will be carried out literally, nothing will optimize the execution. (Should these be carried out in-place? Rather not. Not sure of performance implications of anything yet.)

Columns are accessed by their name. (Unofficially R will probably allow other forms of access.) There will be missing values, contrary to the guidance from database theory. Let R deal with those, for better or worse.

Each of the four columns (meta, before, keys, after) and the whole sentence (before, keys, after) can be accessed for its structurally relevant content by various functions that are to be specified.

  • access tokens before and after relative to keys
  • access each token by its position in the sentence, or in the segment
  • access root, parent, children
  • access all tokens in a window as a group, maybe

That looks like individual tokens need be reified when requested. They will at last have atomic fields.

Each table will keep track of the history of its construction in each sentence's meta.

Low level view of the plan

R has "data frames" that enforce the tabular form of homogeneous named columns of the same length, but data frames apparently do not allow structured content. They allow numbers, truth values, strings, and "factor levels", and presumably NAs and NULLs all over the place.

R has "lists" that contain a number of named values that can be, well, whatever. A data frame is then just a list with suitable content and a "data.frame" as a kind of tag. To allow for the structured content the sentence records, use the list. A projection may be made into a data frame. Not sure if there is a reason to make it a data frame, but that is certainly possible and not too foreign to a user of R.

Now, it might be good to use some sort of "object oriented" composition for all this, to pack all the restrictions and possibilities as "classes". This needs studied.

-- JussiPiitulainen - 2015-02-04

Topic revision: r1 - 2015-02-04 - JussiPiitulainen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback