-- JussiPiitulainen - 2010-12-14

Making the Language Bank of Finland to meet its DTD

This document concerns those corpora in the Language Bank of Finland that appear to be meant to conform to the DTD called sktpxml.dtd, found now in /kielipankki/dtd/. The corpora appear to be those in the the Finnish and Finland Swedish text collections, ftc and fstc, also known as suomen kielen tekstipankki, sktp. Sometimes the names are capitalised.

This document describes a process whereby the files in these corpora have been made so that they actually do conform to their DTD.


SGML -> Roope Havu in 2001 -> XML

Now me.


The most acute issues concern the fact that all the files under scrutiny are broken as XML.

  • not standalone due to two entities of dubious significance
  • and the DTD is not there (I found it buried elsewhere, and had it installed in /kielipankki/dtd/, but the files still seek it where it is not)
  • so XML tools not usable
  • and problems persist even when the documents are made to point to the DTD
  • and even when corrected, the dependence on the DTD is is of dubious value, quite possibly negative
  • also some semantic failures in some of the corpora

A long-term issue is that the DTD is private.

  • the DTD is neither TEI nor XCES, so even after the above issues are adequately addressed, we are still not entirely comfortable
  • we could probably go XCES
  • but Roope has a point about the morphological annotations, so such documents may be better as they are

Character encoding might as well be brought up to date. (Me, attitude?)

  • very minor: the files are in ISO-8859-1 aka Latin-1, not in UTF-8; this is all right in so far as the character encoding is declared, but the world should move over to UTF-8 already now, and UTF-8 is native to XML anyway, so the maintenance of the corpus will be easier if we let the tools save it in their preferred UTF-8

Stages 1 and 2 of the cleanup

In Stage 1, copy the 2009 version of each corpus into a 2010 version, going UTF-8 at this point. (Just iconv, so the declaration of the encoding in the XML declaration becomes invalid.) Also, edit extra documents like backup files in *~ and such.

In Stage 2, correct the DTD in each 2010 XML document to be SYSTEM /kielipankki/dtd/sktpxml.dtd, so that those documents without genuine problems become valid. Also, drop the declaration of ISO-8859-1.

At this point, we are well on our way to having a 2010 version of the corpus, but the documents in question are still in a preliminary state.

(2010-12-29) Now, in Stage 1, I have made a copy of all fstc and ftc. I have also made a database using SQLite3 to find out what the document types are meant to be and what they are. The following are the results.

  • there are 617362 documents named *.xml
  • every one of them has an XML declaration, on a line of its own, one of the following two.
    • <?xml version="1.0" encoding="iso-8859-1" standalone="no"?>
    • <?xml version="1.0" encoding="iso-8859-1"?>
  • every one of them has a doctype declaration, on a line of its own, always the following one.
    • <!DOCTYPE TEI.2 SYSTEM "/usr/lib/sgml/dtd/sktpxml.dtd">
  • every one of them has the document element start tag on the third line, always the following one.
    • <TEI.2>
  • 432518 documents are malformed
  • only 25 documents are invalid, and these are exactly the documents in the corpora soderstroms-a (4 documents) and soderstroms-b (21 documents)
  • the invalid documents are well-formed
  • the only cause of invalidity seems to be the attribute syn of element w (so these are morphologically annotated documents, but what gives?) (There are a lot of these. It is still just one error.)
  • the invalid documents are exactly those that are not declared not standalone! This must be the reason somehow. Or must it? Perhaps the two properties have merely a common origin.

Garbage removal activities in fstc and ftc:

  • removed 47 *.xml~ in FNB left from hand-editing; all were duplicates for *.xml
  • removed 10839 *.xm in hufvudstadsbladet; all were duplicates for *.xml
  • removed 1 *.xml~ in hufvudstadsbladet; was a duplicate with nulls at end
  • removed 5 *.xml~ in aamulehti; all were duplicates for *.xml, either not annotated or with trailing nulls
  • removed 3 *.xml~ in demari; all were duplicates for *.xml with one differing in indentation, one having trailing nulls, and one having spurious empty closer elements
  • removed tmp.tmp in hameensanomat; it was an unannotated version of /367653.xml
  • removed 360507.xml~ in hameensanomat; it was a duplicate for *.xml with two lines of garbage in the middle
  • removed 1 *.xml~ in iltalehti; it was an unannotated duplicate for *.xml
  • removed 1 *.xml~ in kaleva; it was a duplicate for *.xml with garbage in the middle
  • in kangasalansanomat:
    • removed nohup.out which was not exactly for humans anyway
    • removed etusivunhäkämies.x as exact duplicate for etusivunhäkämies.xml
    • renamed etusivunhäkämies.xml as etusivunhakamies.xml (sad, that; the ä's were whatever, one octal 204 byte)
    • removed koulurauha_kuva as exact duplicate for koulurauha_kuva.xml
    • renamed nettpäivä.xml (sic) as nettipaiva.xml (sad, that; the ä's were octal 204)
    • renamed rakennusjärj.xml as rakennusjarj.xml (the ä was octal 204)
    • renamed kyläohj.xml as kylaohj.xml (204)
    • removed myyjäisvinkki and renamed myyjäisvinkki.xml as myyjaisvinkki.xml (204)
    • renamed sauvakäv.xml as sauvakav.xml (204)
    • removed virolavinkki. as exact duplicate for virolavinkki.xml
  • worry about those octal 204's that still occur inside the document metadata: such are not 8859-1, so strictly speaking the XML declarations are fail; octal 204, decimal 132 seems to map to ä in some encoding called IBM-850, whatever nightmare of the past that may now be
  • in karjalainen, SOTKA_RA.untagged was an untagged version of an alueSOTKA_RA.xml in another directory; removed it
  • also in karjalainen, removed a nohup.out which again was not exactly designed for a human to read anyway
  • also in karjalainen, removed a teiTextMorfo.tmp which contained just one short line of apparent corpus text for no apparent reason
  • in keskisuomalainen, removed nine Perl scripts which have no apparent reason to lie where they lay, and a teiHeader and teiTemplate with them, please
  • in turunsanomat, removed a tmp which turned out to be an unannotated version of an EI_LAPSI.xml, and removed a parseErrors.txt which was empty (recreate with touch if you really want it there but if you do please document your intentions in detail)
  • also in turunsanomat, 34 *.xml~; all appear to be duplicates for *.xml; most contain just an extra whitespace-only s element; VALINNOI.xml~ ends with four nulls instead, and PV.xml~ has two revisionDesc elements where apparently one should be; removed all 34

That should have taken care of all backup files (those *.xml~) accidents in file names.

There remain 7 files named indentall.log and 16 named fileStructure.lst or fileStructure.lst2 that do not seem very useful.

  • in aamulehti, one indentall.log tells that 75510 files were found; there are 75510 files in *.xml below aamulehti/2010; removed the indentall.log
  • in demari, one indentall.log tells that 23287 files were found; there are 23280 files in *.xml below demari/2010; where are 7? some directory names are funny; removed the indentall.log
  • in hameensanomat, one indentall.log tells that 5182 files were found; there are 5182 files in *.xml under hameensamomat/2010; removed the indentall.log
  • in hyvinkaansanomat, one indentall.log tells that 7463 files were found; there are 7463 files in *.xml under hyvinkaansanomat/2010; removed the indentall.log
  • in kaleva, one indentall.log tells that 38066 files were found; there are 38066 files in *.xml under kaleva/2010; removed the indentall.log
  • in kangasalansanomat, one indentall.log tells that 1357 files were found; there are 1354 files in *.xml under kangasalansanomat/2010; were are 3? this was the corpus with octal 204 ä and other borkenness in file names; removed the indentall.log
  • in karjalainen, one indentall.log tells that 116047 files were found; there are 116046 files in *.xml under karjalainen/2010; were is 1? could it be the one untagged duplicate one that I removed? removed the indentall.log

Now maybe a finishing round of such renamings and removals. I go back to fstc and hufvudstadsbladet, 2010, to rename the twelve strangely named directories beginning with 1999/dump1999 and ending with 1999/dump2010, but first I learn that something else is funny: all twelve directories from 1998/01 to 1998/12 contain XML files in a pattern like 1998/03/19980301.714.xml, and the 714 in the example is a running number from 1 to the number of files except the counter is almost always a few more than the number of files! This does not seem right. (The exception is 1998/08 with 1568 files and highest counter 1568.) Here's how I see the discrepancy:

  $ ls 1998/03 | wc -l
  $ ls 1998/03 | sort -t . -k 2 -n | tail -n 2
I have not (yet?) thought of finding out which counter values are missing, because I do not know what to do about it anyway.

So. Back to hufvudstadsbladet and 1999/dump1999 to 1999/dump/2010. The documents are named in dump1999.714.xml and such, with the directory misname and a counter. In the metadata inside each document, the publication date is given merely as 1999 (except the last file in dump2010 has it as 2000, go figure) so there is no month there.

Based on the following pattern of most named months in the data itself (two most named months in the directory in upper case, and three more in lower case), I find it reasonable to guess that the directory name map to the months of the year in the obvious order: dump1999 should have been 01, dump2000 should have been 02, and so on until dump2010 should have been 12.

/key                           123456789012
1999/dump1999        januari   Jfm.......nD
1999/dump2000        februari  jFMam.......
1999/dump2001        mars      jfMAm.......
1999/dump2002        april     j.mAMj......
1999/dump2003        maj       ..maMJj.....
1999/dump2004        juni      ...amJJa....
1999/dump2005        juli      ....mJJas...
1999/dump2006        augusti   ....mjjAS...
1999/dump2007        september .......aSOnd
1999/dump2008        oktober   .......asONd
1999/dump2009        november  j.......soND
1999/dump2010        december  Jf.......onD
/key                           123456789012

(Those were grepped ignoring case, but case was observed when deciding on the most frequent ones. Am I being pedantic yet?)

Based on the above data and my guessed mapping, I boldly rename what should not have needed any renaming in the first place.I also put in the constant 01 "day number" as in 1998, to keep to the reasonable pattern yyyymmdd.

The numbers of documents and the largest counter values before and after the renaming match, so I consider 1999 of hufvudstadsbladet renamed. (It would often be nice to have the counters padded to the same length with zeroes on the left so that the lexicographic order would be the same as the numeric order on that field, but I do not do this now. Otherwise I would also have to rename similarly named files elsewhere in these collections. It really is minor.)

Also removed useless fileStructure.lst files in 1998 and 1999 of hufvudstadsbladet. The former even listed different names than in the actual directory. Also removed indentall.log, but for the record, it says to have found 43598 files, while find finds 43591 files in *.xml, and I know not what gives.

Sitten demari. Siellä on hakemistot 1995, 1997, 1998, 1999 ja 2000. Tähän saakka näyttää ehjältä, mutta hakemistossa 1997 on seuraavat alihakemistot:

  01/ 02/ 03/ 04/ 05/ 06/ 07/  10/ 11/  1999/ 99/
Mitä siis lienevät 1997/1999 ja 1999/99? (Voisi myös ajatella, että 1997/08, 1997/09 ja 1997/12 puuttuvat, jos siis haluaa ajatella, että nämä merkkijonot ovat sillä ilmeisellä tavalla merkityksellisiä kuin ihminen ehkä helposti kuvittelee mielellään).

Lisäksi (yhä demari) hakemistossa 1999 (siis ei 1997/1999 vaan se 1999, joka on hakemiston 1997 rinnalla) on alihakemisto 0000 (kaikkien kahdentoista kuukaudennäköisen lisäksi) ja siellä ainoastaan alihakemisto 00, jossa *.xml. Lieneekö mielekästä? Ja hakemistossa 2000 on samanlainen 0000/00 ja lisäksi vielä yksi 1999. (Sivumennen, hakemistossa 1995 ei ole kuukaudennäköisiä alihakemistoja, vaan siellä on heti *.xml.)

Hakemistossa 1997/1999/03 on kolme *.xml=-tiedostoa. Niiden metadatoissa on julkaisupäivämäärät =1999-03-??. Arvaan, että niiden kuuluu olla hakemistossa 1999/03. Siirrän ne sinne. Poistan hakemiston 1997/1999. Hakemistossa 1997/1999/07 on yksi *.xml. Sen metadatassa on julkaisupäivämäärä 1999-07-??, joten arvaan, että sen kuuluu olla hakemistossa 1999/07. Siirrän sen sinne. Poistan hakemiston 1997/1999/07. Hakemistossa 1997/1999/10 on 13 *.xml:ää. Niiden metadatoissa on julkaisupäivämäärät 1999-10-??, joten arvaan, että niiden kuuluu olla hakemistossa 1999/10. Siirrän ne sinne. Poistan hakemiston 1997/1999/10. Poistan hakemiston 1997/1999.

Hakemistossa 1997/99 on kolme *.xml:ää. Niiden metadatoissa on julkaisupäivämäärä 0000-00-00. Nyt ei tunnu olevan oikein riittävää perustetta arvata yhtään mitään, joten jätän sinne. Oho. Huomaan, että näissä tiedostoissa on myös mainittu "alkuperäinen" tiedostonimi. Niiden mukaan näyttää tosiaan siltä, että nämä dokumentit kuuluvat vuoteen 1997. Toisaalta muissa 1997/??=-hakemistoissa on dokumentteja, jotka alkuperäisesti ovat olleet hakemistossa =loka1999 tai marras1999, mutta metadatansa mukaan julkaistu 1997. Ei tämmöistä voi korjata muuten kuin sisällön perusteella ja viittaamalla korpuksen ulkopuolisiin lähteisiin. Toisin sanoen: olkoot missä ovat, mutta hakemistonimiin ja metadatoihin ei nyt voi luottaa.

Hakemiston 2000/0000/00 tiedostoissa on julkaisupäivämäärä 0000-00-00 ja alkuperäiset tiedostonimet hakemistoissa tammi2000, helmi2000 ja maalis2000. Kun nyt on semmoiset hakemistot 2000/01, 2000/02 ja 2000/03, niin siirrän nämä niihin. Sinne menivät. Poistan hakemiston 2000/0000/00. Poistan hakemiston 2000/0000.

Hakemisto 2000/1999/01 on niin sanotusti mielenkiintoinen. Neljä dokumenttia siellä vain on, mutta niiden päiväys on 1999-01-?? ja "alkuperäinen" hakemisto tammi2000. Samoin hakemiston 2000/1999/12 ainoa tiedosto on hakemistosta tammi2000 mutta päivätty 1999-12-??. Oikeaa sijoituspaikkaa ei voi oikein arvata, mutta ei näitä tännekään voi jättää. Perustan vuoden 0000 ja toimin ikään kuin kuukausi olisi näissä tiedetty mutta vuosi ei. Hakemistosta 2000/1999 tulee siis hakemisto 0000 ja sillä on alihakemistot 01 ja 12. (Alla kuitenkin päätän vielä toisin.)

Vielä on palattava hakemistoon 1999, jolla on alihakemisto 0000/00. Sieltä löydän joukon dokumentteja, joiden päiväys on 0000-00-00 mutta alkuperäishakemisto yleensä muotoa kuukausi1999, paitsi osassa helmi-maalis1999 vaikka osassa myös helmi1999 ja osassa maalis-huhti ilman vuotta. Päätän uskoa vuoteen 1999, mutta nostan "kuukauden" 0000/00 vuoden 1999 kuukaudeksi muiden rinnalle. Ei tunnu mielekkäältä siirtää osaa 1999/00:ista sinne minne ne ehkä kuuluvat, kun kaikkia ei kuitenkaan voisi. Poistan nyt tyhjän hakemiston 1999/0000.

Osoittautuukin, että dokumentista hakemistossa 0000/12/ on samanniminen kopio (omituista, luulin ettei näitä olisi, tietokantaindeksieni tähden) hakemistossa 1999/12/ ja ainoahko ero se, että jälkimmäisen alkuperäisnimi vastaa hakemiston nimeä. Poistan dokumentin hakemistosta 0000/12. Poistan nyt tyhjän hakemiston 0000/12. Hakemistossa 0000/01 on vain neljä dokumenttia, alunperin olevinaan hakemistosta tammi2000 mutta päivätty 1999-01-??. Kun niitä on näin vähän ja vuosi 0000 on niin outo, siirrän ne hakemistoon 1999/01 (ei samannimisiä tiedostoja siellä eikä muualla demarissa) ja poistan hakemistot 0000/01 ja 0000.

As a sanity check, at this point there are 23279 *.xml in the new version of demari and 20280 in the old. This exactly accounts for the one duplicate document that I removed.

Time to remove the remaining fluff. I remove one fileStructure.lst in hameensanomat as completely uninteresting. I leave two such in helsinginsanomat because they are more interesting, though apparently out of date. I may remove them later. I remove the one in kaleva. I remove the one in kangasalansanomat. I almost do not remove those in karjalainen because they again do not correspond to the current directory hierarchy, but I reconsider and remove them anyway, together with those in helsinginsanomat (I said I might remove them later, and now I do). They will remain in the old versions of the corpus.

I also remove a file named tmp in a turunsanomat directory. I thought I had removed it already. It was an unannotated version of a document that remains there in that same directory.

Remaining regular files not in *.xml in ftc/*/2010 are now four readme files in helsinginsanomat, three of which list the department codes, and one states that the unannotated version of 1995 has been archived. These will stay for now.

In fstc, in FNB, an indentall.log reports to have found 47830 files, while find finds 47826 files in *.xml. Not knowing what to do with the information, and the slight discrepancy being common throughout these collections, I remove the file.

In jakobstadstidning, an indentall.log agrees with find that there are 10488 files to be found (in *.xml, for find, and who knows what the indentall.log is thinking). I remove the file.

A file notTagged2 in jakobstadsTidning was empty. I removed it. It can always be recreated with touch if someone finds a use for it (unless its time stamp is important, but I suppose anything important would be documented, er, maybe not, but I removed this anyway).

Another file notTagged in that directory contains 636 file names (in *.sgml) while the directory itself contains 1164 files in *.xml, and the couple of those that I checked, corresponding to the names in notTagged, are tagged. I remove the file. (The file appended :0 to each file name, too, but no indication as to what it might mean.)

Remaining fluff in fstc are a couple of files in hufvudstadsbladet that I myself made yesterday, to be removed.

At this point, I made a copy (two compressed tar balls) of all of ftc/*/2010 and fstc/*/2010, made a database of validity information "before", corrected the doctype declaration of every 2010//*.xml file to point to where the Kielipankki DTD actually is, and made a database "after". However, I found that my database creation program does notice duplicate file names, and I do have them in five corpora, and I think I do want to fix that. The duplicated names are listed in seendups now, with counts inside the corpus, and the five corpora are demari, kangasalansanomat, karjalainen, keskisuomalainen and turunsanomat. (This is also an opportunity to remove an unwanted comma in the database script element table declaration. Actually, let me do that and open the "before" archives of ftc and fstc to recreate the "before" database. Er. No, fixed the comma but I'll solve the duplicate names first and redo the databases only then.)

Let us take demari first. There are 43 duplicate names, each appearing twice in the corpus. I take the first, one in the middle, and the last, to see if their content matches. The first name, 235ul2.xml, occurs in 1998/11 and 1999/12. The two documents turn out to be different. Unfortunately, the first one also turns out to be dated 1997-07-24 which does not match its directory name or its original directory marras1998. [Censored]

The middle demari duplicate name is, to pick one, 140syrj.xml. It occurs in the directories 1998/07 and 1999/07. The documents are different. Their dates match the current directories, and they almost match the original directories: the one in 1999/07 is said to have been in kesa1999, which looks like the month should be 06.

The last demari duplicate name, 043muut.xml, occurs in the directories 1997/03 and 2000/03. The documents are different and this time their names and dates match, though the original directory of the 1997 one does not refer to a month at all.

Decision: the duplicate document names in the demari belong to different documents and can be sensibly renamed. All demari basenames in d*.xml occur in 1995 and are in demaDIGITS.xml. I rename the documents in the other years by adding the prefix demYYMM where the YY and MM refer to their current directory names and never mind whether the various candidate dates of any individual document match each other. (There is never any risk of introducing a name already in use in 1995 because they have dema with a in the end and the new names do not.)

There were 23279 demari documents before the renaming, and there are 23279 after. That be good.

Next there is kangasalansanomat. There are 26 duplicate names. Among them is still another name with a funny-encoded ä. [Censored] There appear to be many of those in this corpus. Oh well. Must bite.

In EijaKoivu, 75 files. Rename el?inl??.xml to elainlaa.xml. Learn that ?ij?kuva.xml and ?ij?.xml are about a person named Äijö, not about an äijä, so rename them to aijokuva.xml and aijo.xml. Rename kev?tp.xml to kevatp.xml. Renama k?nnykk?.xml to kannykka.xml. Rename syntt?ri.xml to synttari.xml. Next, t?rm?kuv.xml is a six-word document (metadata says twelve, but there are only six, plus two punctuation marks) that gives no clue about the identities of the ?. Let it become txrmxkuv.xml. And rename yritt?j?.xml to yrittaja.xml. Sigh. Still 75 files here.

In JouniValkeeniemi, 602 files, but only a couple have ?='s. Rename =et?lukio.xml as etalukio.xml, but j?tski.xml and jatski.xml both exist and are different documents, so I rename j?tski.xml to jaatelo.xml instead. Rename keng?t.xml to kengat.xml and kev?tpy?.xml to kevatpyo.xml (it's about bikers in the spring) and wish once more that these had not been named so meaningfully. Rename l?hpiiri.xml as lahpiiri.xml. Rename m?yr?vuo.xml to mayravuo.xml. And nelj?h.xml to neljah.xml though 4h.xml is also tempting oops getting carried away. And ??nestys.xml to aanestys.xml, py?p?iv?.xml to pyopaiva.xml and py?r?4.xml to pyora4.xml. Then s??nn?t.xml to saannot.xml and str?mmer.xml to strommer.xml. Finally, ty?t{04,11,28}.xml to tyotDD.xml, cannot bother to check their content to be (more) sure. I think that was all of them, and they were more than a couple after all. There are still 602 files in *.xml in this directory.

None of these in MattiKauhanen but there is one kangas~1.xml that I rename to kangas1.xml. (It is different from kangsa.xml.) None of these in PekkaKaarna.xml. Several in SariEerolainen, including the duplicate of kev?tp.xml that again becomes kevatp.xml. Rename h??juhla.xml to haajuhla.xml And l?hem.xml says, muistakaamme lähemmäisiämme (yes it does), so I rename it to lahem.xml and be done with it. Oh, there is still heikkil?.xml which becomes heikkila.xml.

Ooh. Still I find yleis?.xml in JouniValkeeniemi and rename it to yleiso.xml. Now I think I'm done with ä's and ö's in kangasalansanomat file names. About those duplicates, then. First I see that the two named talous.xml are different. They totally are. And the two named gallup.xml. They ever so totally are. What about the two named kevatp.xml? Different. Ok, believe that these all differ, and just canonicalize the names by adding the author initials as a prefix to each. There is jvoksa.xml that becomes jvjvoksa.xml and there is pk3012.xml that becomes pkpk3012.xml but nothing becomes overridden. Before, 1354 files in "*.xml" in the corpus. After, still 1354 files.

Remain keskisuomalainen and turunsanomat to deal with duplicate names in.

In keskisuomalainen, 2046 document names occur more than once, and almost all of these 2046 occur four times. The names appear to be in 1994/*/keskiDIGITS.xml where the digits are a counter within a 1994/0M where M is one of 1, 2, 3, 4. (Names in 1999 are different but they are also unique within the corpus.) Ima prefix 1994/0M with M and be done with it. First there are 8367 *.xml in 1994 distributed over the four months as 2039, 2046, 2250, and 2032. After the renaming, the numbers are the same. That should have taken care of keskisuomalainen.

In turunsanomat, 16597 document names occur two to four times. The pattern is YYYY/L/L/NAME.xml, whare each =L is a lowercase ASCII letter and each name attempts to be descibe the content of the document. For some obscure reason, find becomes insensitive to case when I try to match a range of characters, but setting LC_ALL to C appears to fix that and I am able to confirm that no document name yet starts with a lower case ASCII character, so it is safe to prefix the document names with that LL that are the two digits naming each directory in the corpus. Before doing that, there are 88617 files in *.xml in the corpus. After doing that, there are still 88617 files in *.xml. Whew. And a couple of samples of names names, at head and tail of 199?/b/h, seem ok.

Now there should be no duplicate names not no more. Time to run a new database. Well, no, time to go home, and I will launch the run from there later.

Bother. There are still 14669 duplicated names in turunsanomat, occurring both in 1998 and 1999. Stupid me. I will prefix them all with 98 or 99 still. So, first there are 88718 *.xml in turunsanomat/2010, of which 43861 in 1998 and 44756 in 1999. After, 88617 in turunsanomat/2010, of which 43861 in 1998 and 44756 in 1999, as it should be.

No duplicates found. Re-creating the fault-finding database, or rather, creating another. A preliminary observation suggests that the addition of the syn attribute to the w element in the DTD (a private copy, to replace the public copy) appears to have corrected all the remaining validity errors. However, the documents are still not well-formed without loading the DTD where two entities of dubious necessity are defined: corpuslanguages and corpustaxonomy.

But this belongs to the further steps already.

Stage 3 of the cleanup

Make every document validated. That is, make them all valid, and demonstrate their validity by using a validating parser on them.

Most should be all right already. Correct what remains. Take notes.

The validating parser to use is xmllint from the libxml project. Also work out how to do the validation from within Python, using lxml, as an example. Document this anyway.

Stages 4 and onwards of the 2010 cleanup

Finally or so, undo the entity-induced dependence on the DTD. We want valid, well-formed XML that can be manipulated as well-formed XML of known structure without loading the DTD. This may require changing some IDREF things in the documents, since these appear to be the reason for the introduction of those entities in the first place.

When this is fixed, perhaps even drop the DOCTYPE declaration altogether. Or should we still keep it? Can we use and validate the documents then even if the DTD moves to a different location? Find out and decide.

Semantic corrections are quite likely outside my resources now. Formally valid markup is the goal for every single document is the goal.

The execution and finish

Practice with Jakobstads Tidning, the whole corpus, and also with some unannotated corpus.

When all else is good, add some statement of my intrusion in the headers. (Roope didn't.)

Topic revision: r18 - 2011-02-07 - JussiPiitulainen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback