  • BAULT Seminar No 2 at 11-12 on 28th November 2013
    Presentations by Kimmo Koskenniemi, Mikhail Kopotev and Roman Yangarber, see a separate page. Organized by Roman Yangarber.

  • Visit of Max Silberztein on 19th-23rd August 2013, including a lecture on NooJ for a wider public on Monday 19th August, at 14-16 in lecture room 12 in Metsätalo, 3rd floor, and a hands on seminar on Tuesday 20th, Wednesday 21st, Thursday 22nd at 10-12 and 14-16 in the computer room 25 (Metsätalo 5th floor). Organized by Kimmo Koskenniemi.

  • From correspondence to corpora - a full day seminar on 15 Nov 2013 sponsored by BAULT. Invited speakers were Pia Forssell (SLS), Nina Martola (KOTUS), Marijke van der Wal (Leiden University), Alison Wiggins (University of Glasgow). Organized by Terttu Nevalainen and Jan Lindström.

  • BAULT seminar No5 on 30th January 2014. Presentations by Lauri Carlson and Arvi Hurskainen. Organized by Roman Yangarber.

  • A short BAULT seminar on 14 March 2014, presentation by Juraj Šimko, "Speech as a skilled, efficient communicative action" (abstract). Šimko is Martti Vainio's post doc researcher. Organized by Roman Yangarber.

  • BAULT seminar on 6th May 2014, "Visualizing speech and teaching phonetics: from the IPA to Big Data". Organized by Martti Vainio.

  • BAULT seminar on 9 May 2014 where Kirill Reshetnikov from Russian Academy of Sciences ("Regional and minority languages of Russia on the Internet"). Organized by Roman Yangarber.

  • BAULT seminar on 2nd June 2014 with presentations by two visiting scholars from Japan: Xiaoyun Wang ("Phoneme Set Design for Speech Recognition of English by Japanese") and Seiichi Yamamoto ("Multimodal Corpus of Multiparty Conversations in L1 and L2 and Findings Obtained from it"). Organized by Kristiina Jokinen.

  • BAULT seminar No 10 on 5th June 2014 where projects working with Uralic language technologies and materials present themselves. Organized by Jack Rueter.

Future presentations

If you have further proposals for future presentations, please contact Kimmo Koskenniemi (kimmo.koskenniemi ät and provide a tentative title and short summary for the presentation.

Time Presenter Topics
-- Arto Mustajoki (TBA, tentative:) Uses of linguistic corpora
-- Roman Yangarber Analysis and tracking of news: leveraging across multiple LTs
-- Roman Yangarber LT for revitalization of endangered Uralic languages
-- Mikhail Kopotev Detection of Stable Grammatical Features in N-Grams
-- Krister Lindén FrameNet for Finnish
-- Krister Lindén Terminology Extraction
-- Krister Lindén Hyperminimization
-- Anssi Yli-Jyrä Finitary Linear Models of Phonology, Morphology and Syntax

R Yangarber: Analysis and tracking of news

The PULS project builds tools for semantic analysis of plain text—specifically for tracking on-line news media. We conduct research in Information Extraction (IE), which is a kind of language-understanding technology. In IE, the task is to find certain types of facts or events, in text. Once the facts are collected into a database, we can perform reasoning and inference over the collected knowledge. We focus at present on three subject domains:

  • surveillance of epidemics,
  • business intelligence,
  • cross-border security and criminal activity.

A central research theme is acquisition of various kinds of domain-specific linguistic knowledge, with minimal supervision, directly from a large corpus of news:

  • We try to learn syntactic and semantic patterns which enable us to recognize names, and for each name its semantic class (person, organization, product, etc.)
  • We try to learn patterns of how interesting events are stated in text.
IE is a rich area, covering many sub-topics in language technology and linguistics, syntax, semantics, anaphora and co-reference, discourse analysis; wherever possible, we try to use automatic reasoning and machine learning.

M Kopotev: Detection of Stable Grammatical Features in N-Grams

We present work on a general-purpose system that allows the user to issue a query pattern, collects multi-word expressions (MWEs) that match the pattern, and then ranks them in a uniform fashion. This is achieved by quantifying the strength of all possible relations between the tokens and their features in the MWEs. The algorithm collects the frequency of morphological categories of the given pattern on a unified scale in order to find the stable categories and their values. For every part of speech, and for all of its sub-categories, we calculate a normalized Kullback-Leibler divergence between the category's distribution in the pattern and its distribution in a large corpus. Categories with the largest divergence are considered to be the most significant. The particular values of the categories are sorted according to a frequency ratio. As a result, we obtain morpho-syntactic profiles of a given pattern, which includes the most stable category of the pattern, and their values.

The system has so far been tested on a Russian corpus, but we would like to explore how these ideas would be applicable more generally, to other languages.

K Lindén: FrameNet for Finnish

A parallel Finnish-English FrameNet is soarly missing. It would be useful for creating grammars for numerous language technology applications. It would also serve as training data for machine translation applications. Investigating methods for creating it semi-automatically from parallel or comparable corpora would benefit the development of LT methods for the semantic web. To be presented at a later seminar.

K Lindén: Terminology Extraction

Automatic discovery of terms remains an elusive topic. Fundamental linguistic research is needed to identify the characteristics of good terms and how to apply this knowledge to discover terms using mono- and multilingual clues. Term discovery is related to the more general problem of discovering semantic units in text. Terminology discovery and extraction is a further step to discover relations between terms. The problem has practical applications for translators and for support of high-quality human translation. To be presented at a later seminar.

K Lindén: Hyperminimization

Creating minimal descriptions of language data is useful, but despite our best efforts some descriptions still remain too large for being practical when encoded into one single data structure. To go beyond the minimal into hyperminimal without loss of information often requires that one monolithic data structure is split into several that can then be recombined at runtime. This problem is encountered already when encoding lexicons of polyagglutinative languages such as Greenlandic which encode phrases as words. Similar problems are encountered when compiling a grammar of any language into a single data structure. For the foreseeable future this will remain an urgent problem for storing and efficiently applying large-scale linguistic rule sets. To be presented at a later seminar.

A Yli-Jyrä: Finitary Linear Models

The current theme is concerned with simple, surface oriented grammatical theories of natural language in contrast to such theories that are often labeled as deep, cognitive or generative. Linguistic Domains. The theme recognizes such things as: 1) grammatical functions, valences and word-order constraints in syntax 2) underlying tones, morphemes, and alternations on morpho-phonology. 3) linguistics universals and empirically motivated language-specific linguistic knowledge 4) simplicity that allow learning constraints, alternations and lexicon from training data 5) efficient wide-coverage analysis and generation of natural languages must be possible 6) the syntactic or morphological analyzers are developed and evaluated empirically, even with spoken data instead od laboratory examples Assumptions. The theme is based on the following claims: 1) Syntax: Linear dependency/constituent structure is more characteristic to natural language than embedding dependency/constituent structure. 2) Syntax: Linear structures can sometimes overlap or be discontinuous. 3) Phonology: The composition of phonological rules yields the two-level structure that is more important than the rules. 4) Phonology: Tone structure can be independent, but associated with the segmental structure. 5) Phonology, Morphology & Syntax: The surface of the languages is a fuzzy finite-state language. 6) Phonology, Morphology & Syntax: Automatic analyzers are seen formally as finite-state relations. 7) Phonology, Morphology & Syntax:There are tighter hypotheses than finite-statedness. These hypotheses concentrate on linear units (linear phonotaxis, concatenative morphology, chunks, verb chains, contexts constraints), autosegmental units (tone-patterns, valences, vowel harmony patterns). The Substance. The theme would comprise empirical, typological and mathematical approaches to language complexity. There is a lot of pertinent research at the department: 1) Syntactic complexity (Karlsson et al.) 2) Finite-state phonology, morphology and syntax (Koskenniemi, Hurskainen, Yli-Jyrä, Lindén, Huldén, Silfverberg, Pirinen) 3) Surface oriented frameworks and grammars (Karlsson, Koskenniemi et al, Voutilainen, Mauranen, Yli-Jyrä) 4) Two-level, multilinear tonology and morphology (L Aunio, Huldén, Yli-Jyrä) 5) Star-free subclasses of finite-state languages (Yli-Jyrä) 6) Machine learning of finite-state grammars (Koskenniemi, Huldén, Silfverberberg, Yli-Jyrä) There are currently at least two academy funded research projects associated to this theme.

