Tänne tulee ajatuksia siitä, miten relaatioita lähdetään etsimään.

Leksikaalissyntaktisia kaavoja

Hyponymia/Hyperonymia

Hearst: Automatic Acquisition of Hyponyms from Large Text Corpora

  1. NPo such as {NP1, NP2....(and|or)} NPn
    "The bow lute, such as the Bambara ndang,...."
  2. such NP as {NP ,}* {(or|and)} NP
    "...works by such authors as Herrick, Goldsmith, and Shakespeare."
  3. NP {, NP} * {,} or other NP
    "Bruises, wounds, broken bones or other injuries..."
  4. NP {, NP}* {,} and other NP
    "...temples, treasuries,altd other important civic buildings."
  5. NP {,} including {NP,}* {or|and} NP
    "All common-law countries, including Canada and England..."
  6. NP {,} especially {NP ,}* {or|and} NP
    "...most: European countries, especially France, England, and Spain."

Snow, Jurafsky ja Y. Ng: Learning syntactic pattern for automatic hypernym discovery

  1. NPy like NPx:
  2. NPy called NPx:
  3. NPx is a NPy :
  4. NPx, a NPy (appositive):

Yap ja Baldwin: Experiments on Pattern-based Relation Learning
Baseline:

  1. X and other Y
  2. X or other Y
  3. Y such as X
  4. such Y as X
  5. Y including X
  6. Y, especially X

Meronymia/Holonymia

Berland ja Charniak: Finding Parts in Very Large Corpora

  1. whole NN[-PL]'s POS part NN[-PL]
    "...building's basement..."
  2. part NN[-PL] of PREP {the|a } DET mods [JJ|NN]* whole NN
    "...basement of a building..."
  3. part NN in PREP {the|a} DET mods [JJ|NN]* whole NN
    "...basement in a building..."
  4. parts NN-PL of PREP wholes NN-PL
    "...basements of buildings..."
  5. parts NN-PL in PREP wholes NN-PL
    "...basements in buildings..."

Synonymia

Yap ja Baldwin: Experiments on Pattern-based Relation Learning
Baseline:
  1. X and Y
  2. X or Y

Antonymia

Yap ja Baldwin: Experiments on Pattern-based Relation Learning
Baseline:
  1. from X to Y
  2. either X or Y

Etsintä suomen kielelle

Aineistot ja työkalut

Hippu.csc.fi-palvelimelta pääsee suomen kielen tekstikokoelmaan: http://www.csc.fi/english/research/software/ftc

Aineisto on XML-muodossa:

          <s>
            <w lemma="Näin" msd="" type="Adverb">Näin</w>
            <w lemma="ei" msd="Pr Act Ind S 3P" type="Verb">ei</w>
            <w lemma="pitää" msd="Nom SG Act IIpartic" type="Verb">pitänyt</w>
            <w lemma="käydä" msd="Nom SG Act Iinf" type="Verb">käydä</w>
            <w lemma="_PERIOD" msd="" type="Delimiter">.</w>
          </s>

Kuitenkin esim. WSOY:n teksteissä kappaleet ovat vain p-tagin sisällä.

Lisäksi CSC:n Tutkijan käyttöliittymästä löytyy Lemmie, joka on korpustyökalu Kielipankin suomen ja suomenruotsin tekstikokoelmien tutkimiseen.

Kokeiluja

Kokeilin muutamia suomenkielisiä kaavoja. Aineistona käytin Helsingin sanomia vuodelta 1995 ja sieltä kansiota, joka sisälsi 8202 XML-tiedostoa. Ohessa tuloksia:

Kaava Löydettyjä Relaatio Hyviä FWN:ssä jo olevat Uusia sanoja Muuta huomioitavaa
eli 1179 suurin osa synonyymeja 156 eli n. 13,2 % 13 eli 8,3/ 1,1 % 134 Elokuvien käännöksiä, muutamia hyperonyymeja
kuten
kuten_esimerkiksi
591 Hyperonymia-hyponymia 74 eli 12,5 % 11 eli 14,9 % 64  
ja .... muu
tai ... muu
712 Hyperonymia-hyponymia 244 eli 34,3 % 73 eli 29,9 % 141  
sekä .... että 365 antonymia
yhteinen hyperonyymi
174 eli 47,7 % 91
52,3 %
73(mm. Suomen kaupunkeja) antonyymeissä tulkinnanvaraisuuksia: henkinen/aineellinen vs. psykologinen/aineellinen

maita ja kaupunkeja: jonka vettä sekä Uzbekistanissa että Tadzikistanissa
meni kaksin kappalein sekä Imatran että Tampereen

myös substantiivisia antonyymejä: äiti/isä, ostaja/myyjä

Rinnakkaistermejä, vaikka voisivat olla antonyymejä: mies/nainen, kesä/talvi
N ja N
(500 tiedostoa)
760 rinnakkaistermejä/yhteinen (kaukainen) hyperonyymi 428 eli n. 56,3 % 319 eli 74.5/42 % 108 FWN:ssä olevia relaatioita paljon, koska samat sanat toistuvat:
ilmansuuntia 78 eli 18,2 %
N ja N
(erisnimet sallittu)
391
Molemmat erisnimiä: 268
rinnakkaistermejä/yhteinen (kaukainen) hyperonyymi 112/200
56 %
31
27,7 %
46+1 Säätekstejä, joten samat relaatiot toistuivat, esim. Selkämeri/Saaristomeri, Ahvenanmeri/Saaristomeri ja Merenkurkku/Perämeri yhteensä 40 kpl hyvistä

Erisnimiä, esim. Suomen kaupunkeja
Adj ja Adj
(2000 tiedostoa)
579 synonymia, antonymia, rinnakkaistermejä, epämääräisiä 172 eli 29,7 %
synonymia (28), antonymia (56), rinnakkaistermejä (24), epämääräisiä (56<)
72 eli 41,9/12,4 %
antonyymejä 29, rinnakkaistermejä 19, synonyymejä 12
24 Substantiiveina rinnakkaistermejä: punainen/sininen, ranskalainen/italialainen

Epämääräiset liittyvät toisiinsa: vanha/perinteikäs, kylmä/vetoinen, pieni/siro
joko ... tai (4 sanaa) 114 rinnakkaistermi: 24, antonyymi: 5, syn: 4, hyponyymi: 1 39 eli 34,2 % 21 eli 53,8 % 26 hyponyymi: kaupunkien viherrakentamisessa tai joko maanviljelyssä tai puutarhanhoidossa.

Kaava Hyviä esimerkkejä Huonoja esimerkkejä
eli hiilimonoksidi eli häkä
primaatit eli kädelliset
tilatusta kuolemasta eli eutanasiasta
väittää gerontologian eli vanhuustieteen professori
muita kieliä eli käytännössä englantia
10-20 tonnia muovia päivässä eli 3000-6000 tonnia vuodessa
4,5 prosenttiin eli lain määräämään..
650 miljoonaa puntaa eli 4,9 miljardia markkaa
ja vasoja 119 eli yli puolet
edellisen jakson kaltaisena eli varsin talvisena
kuten
kuten_esimerkiksi
aivan_kuten
mitenkuten
jotenkuten
havupuut, kuten kuusi ja mänty
maissa, kuten Hollannissa ja Saksassa
sählyn kuten lajia kutsutaan
koska maksuaikakorttien kuten Visan käyttö on
harrastuksia, kuten golf tai tennis
toimistot palvelevat kuten ennenkin
maanantaihin mennessä jotenkuten ehditty käydä läpi
identiteettikriisejä, kuten joulun alla,
ja ... muu
kuten ... muu
valkoherukkaa ja muita marjoja
joiden peruspäivärahat ja muut tulot
ala-aulasta luokkiin ja muihin tiloihin
tarinaa Moskovan olympialaisista ja muuta mukavaa
seurakuntamatkojen vetämisen Israeliin ja muihin kohteisiin.
sekä .... että ehtävässä sekä musta että valkea
kaikki varusmiehet palvelevat sekä talvella että kesällä
kiinnostavaa katsomista, sekä uutta että vanhaa
soittaa sekä akustista että sähkökitaraa
Poikkeusta varten sekä Suomessa että Ruotsissa
ja sateet tulevat sekä lumena että räntänä
Rakennusvalvonta kantaa huolta sekä rakennusten turvallisuudesta että terveellisyydestä.
mutta sisälsi sekä arvokkuuden että hauskuuden elementit
veropolitiikalla on sekä vähennettävä työttömyyttä että kerättävä uusia tuloja valtiolle.
N ja N
(500 tiedostoa)
säähän liittyviä:
Voimistuvaa kaakon ja idän välistä tuulta
aamulla puolipilvistä ja poutaa
perjantain ja lauantain tienoilla sää poutaantuu

eläimiä:
ovat keltasirkku, varpunen ja talitiainen, 50 yksilöä
, sinisorsa, hömötiainen ja tilhi.

ruokia:
Leikkaa kesäkurpitsat ja tomaatit pieniksi tasaisiksi kuutioiksi
tuotteita, mm. ankkaa ja vasikkaa.

henkilöitä:
tuoda pöytään ulkomaisille liikekumppaneille ja turisteille.
kysymyksiä konsulteille, kiinteistönhoitajille ja isännöitsijöille.

muuta:
hinnan ja laadun suhde on hyvä
ulkomaanmatkoja oli runsaat miljoona ja tartuntoja 7000 vuodessa
jolla ruoan puhtaus ja säilyvyys taataan
Tallinnan taksien tavat ja autot ovat parantuneet huomattavasti
Adj ja Adj
(2000 tiedostoa)
muutamia kansalaisuuksia:
Pietari suomalaisena ja venäläisenä kaupunkina.

antonyymejä:
Lapsen kanssa - hyvinä ja pahoina päivinä
verkkoaita rajaa yksityisen ja julkisen tilan

synonyymejä:
käyttäminen on useimmiten selkeää ja yksinkertaista
puhdas ja kirkas laulaminen vetoavat mieleen
ovat kuitenkin varsin käteviä ja hygienisiä matkoilla ja retkillä
tuoksut ovat melko voimakkaita ja selväpiirteisiä - ne ovat..
on siellä hyvin kuivaa ja kirkasta.
muovinen mittasarja on hyvä ja edullinen hankinta.
Golem on kuitenkin kömpelö ja vaarallinen:
Japanissa bambu on perinteinen ja monikäyttöinen luonnonmateriaali
joko ... tai (4 sanaa) olevan alunperin lähtöisin joko Venäjältä tai Virosta
kerralla komissio joko hyväksytään tai hylätään.
tekivät karkeita virheitä joko oman kyvyttömyytensä tai poliittisen johdon kiivaan hoputtelun vuoksi.
Ne ovat joko aiheita tai teoksen fyysisiä osia.
tarkoituksena on edistää mainostettavien juomien myyntiä joko kilpailijoiden kustannuksella tai kulutuksen tasoa nostamalla

Tuloksiin vaikuttaa myös se, että samat uutiset toistuvat. Esim:

  • ulhs950123agn.xml: "...ilmoittautui kiihkoislamilainen Jihad- eli Pyhä sota-järjestö, joka vastustaa..."
  • ulhs950125agj.xml: "...islamilaisen jihadin eli pyhän sodan julistaminen koko Kaukasuksella."
  • eths950123acw.xml: "...rauhanprosessia vastustava ääri-islamilainen Jihad- eli Pyhä sota -järjestö ilmoittautui verilöylyn tekijöiksi."

Konteksti

Hyponyymi/Hyperonyymi
englanti/kieli, omena/hedelmä, kana/liha, lounas/ateria, kissa/kotieläin, jääkaappi/kodinkone, eläin/nisäkäs/koira, tomaatti/vihannes, flunssa/tauti, musta/väri, mansikka/marja, maa/Suomi, auto/kulkuneuvo, valtio/Venäjä, tuuli/sää, yhteentörmäys/tapaturma, tietokone/laitteisto, sohva/huonekalu, hiiri/jyrsijä, lohi/kala

272 virkettä (10/20 synsettiä), joista 215 on maa/Suomi:
- Matkustaja saa tuoda verotta Suomeen EU-maista verollisina ostamiaan tuotteita lahjaksi tai ..
- Toisesta EU-maasta verollisena ostetusta autosta ei tarvitse kantaa Suomen rajalla veroja.
- muutamissa maissa, kuten Suomessa
- useat suuret valtiot, kuten Yhdysvallat, Venäjä ja Ranska
- uskoo vakaasti, ettei Suomi kuten mikään muukaan maa pärjää..

Synonyymi

suuri/iso/valtava, arvonlisävero/alv, keskusta/keskus/keskipiste, tavanomainen/arkinen, influenssa/flunssa/nuhakuume, onnettomuus/tapaturma/vahinko, eläin/elukka/eläimistö, pieni/vähäinen/pikku/vaatimaton, kaunis/ihana, laidun/keto/niitty/laidunmaa, tuote/teos, järjestö/johtoelin/organisaatio/laitos/hallinto/johto, ihminen/henkilö, olento/eliö, rooli/hahmo/persoona/osa, puute/tarve, aika/ajanjakso/kausi, penger/piennar/valli, ajatus/idea, aiheuttaja/syy, maalaus/taulu

68 virkettä (10/21 synsettiä), suuri 20, pieni 13, järjestö 10, rooli 11:
- Hän painottaa ettei tarkoita mezzojen olevan osistaan jotenkin katkeria, vaan roolit ovat muotoutuneet jo vuosisatoja sitten...
- Pablo Picasson maalaus Nainen ja sininen kaulus (1941) oli yksi Moderna museetista varastetuista tauluista.
- Aikaa on kulunut näinkin pienen ja vaatimattoman asian hiomiseksi, Leino kertoo.
- Mielipiteiden ilmaisemiseen saattaa liittyä ajatuksia tai kommentteja yleisistä ideoista tai huomautuksia tositapahtumia koskevista uutisista.
- Kesän ensimmäinen suuri rockjuhla on lyhyen tauon jälkeen vanhoissa mitoissaan isoa päälavaa myöten.
- Kansallisarkisto ja seitsemän muuta laitosta kerää koko vuoden talteen välirauhansopimuksen jälkeen lakkautettuen järjestöjen papereita ja esineitä.

Toteutus

Toteutettava skripti (Pythonilla mahdollisesti) voisi toimia seuraavasti:

  1. Valitaan X määrä esimerkkipareja halutulle relaatiolle
  2. Etsitään nämä valitusta aineistosta ja tallennetaan ympäristö
  3. Etsitään yhtäläisyyksiä (miten?) ja valitaan sopivat mallit
  4. Etsitään valituilla malleilla uusia sanoja

Kysymyksiä: Mikä aineisto? Relaatio? Miten yleistää mallit?

Wikipedia-aineisto

8500 sivua (page-tagi), näistä 396 dokumenttia (noin 4,7 \%) oli uudelleenohjauksia

CSC:llä on kaksi jäsennystyökalua: Kielikoneen textmorfo ja Connexorin fi-fdg. Näistä ensimmäinen ei tulosteessa säilytä tekstin sanajärjestystä, joten ajattelin käyttää jälkimmäistä. Haetaan taipumattomat mallit ja jäsennetään tarpeen mukaan:

  1. eli: 1542 virkettä
  2. ja: 35 502
  3. kuten: 1136
  4. on/ovat: 27 305
  5. tai: 3065

Kaava Löydettyjä Relaatio Hyviä FWN:ssä jo olevat Uusia sanoja Muuta huomioitavaa
eli 1405
HS: 1179
synonymia 468
hyponymia 21
selite 47
käännös 49
586/1100
53,3 %
HS: 13,2 %
66
11,3 %
HS:13 (8,3 %)
634+12
selite +34
käännösrelaatiota 45, joissa +88 sanaa
HS 134
Käännösrelaatiossa myös lyhenteitä, esim. JAA eli Joint Aviation Authorities
kuten, kuten esimerkiksi 1151
HS: 591
Hyperonymia-hyponymia 242/591
40,9 %
HS: 12,5%
41
16,9 %
HS 14,9 %
207+4
lisäksi 138
HS 64
katsottiin koko virkettä, erisnimiä paljon, mm. bändejä ja henkilöitä

Jätettiin pois elokuvien, sarjojen, laulujen nimiä: 31 tapausta
ja muu, tai muu 745
HS: 712
hyperonymia-hyponymia 263/745
35,3 %
HS: 34,3 %
69
26,2 %
HS 29,9 %
151+3
lisäksi +11
HS 141
Jonkin verran erisnimien määritelmiä.
sekä...että
(4 sanaa)
400
HS: 365
yhteinen hyperonyymi, antonymia 270/400
67,5 %
HS: 47,7 %
115
42,6 %
HS 52,3 %
213+8
lisäksi +6
HS 73
Jonkin verran erisnimiä, joilla yhteinen hyperonyymi (esim. ihmiset)
N ja N 10 732
HS: 760
(500 tiedostoa)
kaikki: n. 11 600
yhteinen (kaukainen) hyperonyymi 436/760
57,4 %
HS: 56,3 %
253
58 %
HS: 74,5 %
217+8
HS:108
Joillakin melko kaukainen yhteinen hyperonyymi
Uusi merkitys sanalle, esim. disko musiikkina, englanti oppiaineena...
N ja N
erisnimet sallittu
8640
molemmat erisnimiä: 5952
HS: 391 (500 tiedostoa)
kaikki 19 385
yhteinen (kaukainen) hyperonyymi 99/200
49,5 %
HS 56 %
18
18,2 %
HS 27,7 %
117+2
HS 46
Erisnimiä, myös kuvitteellisten hahmojen, esim. oopperoista
A ja A 1365
HS: 579
(2000 tiedostoa)
kaikki: n. 1600
synonymia 32, antonymia 42, yhteinen hyperonyymi 58 136/580
23,4 %
HS: 29,7 %
96/136
70,6 %
HS: 41,9 %
14+9=23
HS:24
yhteinen hyperonyymi: kansalaisuus ja värit (adj. ei toimi, subst. kyllä). Fiwn:stä löytyvistä 45 katsottu substantiivina

9 sanaa löytyi adjektiivisena mutta ei substantiivina: vaaleansininen, neuvostoliittolainen

Ongelmallisia: kuiva/puolikuiva, emäksinen/ultraemäksinen
joko...tai
(4 sanaa)
145
HS: 114
rinnakkaistermejä, yhteinen (kaukainen) hyperonyymi, antonyymejä 75/145
51,7 %
HS: 34,2 %
41/75
54,7%
HS: 53,8%
53
HS: 26
Erisnimiä ja numeroita
Uudet sanat eivät välttämättä hyviä: harjoittelukoulu, elektroninen lehti
NP(nom)...on/ovat...NP 17 124 virkettä
HS: 434 virkettä, joissa "olla", ei apuverbinä: 119
(50 tiedostoa)
Hyperonymia-hyponymia 56
meronymia 9
223 virkettä, 250 tapausta 72/250
28,8 %
15/72
20,8 %
58  
Ensimmäinen virke   Hyp 93/100
93 %
2/93
2,2 %
107
lisäksi +20
selitteitä 7
Erisnimiä, näillä useampi määritelmä. Ekasta virkkeestä myös pari synoja (20)

-- PaulaPaakko - 2010-11-15

Topic attachments
I Attachment Action Size Date Who Comment
XMLxml 3.xml manage 7.9 K 2011-01-31 - 12:43 UnknownUser Esimerkki wikipedia-aineiston sivusta
PDFpdf Semantic_Relations.pdf manage 47.9 K 2010-11-15 - 09:25 UnknownUser Testejä semanttisille relaatioille (kirjasta "A WordNet from the Ground Up")
Texttxt uudet-relaatiot.txt manage 9.0 K 2010-12-07 - 14:38 UnknownUser Uusia relaatioita
Texttxt uudet-sanat.txt manage 3.7 K 2010-12-07 - 14:37 UnknownUser Uusia sanoja
Topic revision: r61 - 2011-05-09 - PaulaPaakko
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback