Back to: SuomenKielipankki:Helpdesk

Helpdesk Item: How to Generate Word Lists

Description of the Problem

I want to generate word lists from XML files in the Language Bank. How can I parse XML.

The Answers

There are several approaches, each of which could be described here. Currently, the following approaches have been described:
  • Using tei2snt

Using the utility tei2snt

Key tools:
  • find - lists the files in the corpus
  • xargs cat - lists the content of the files
  • tei2snt - strips the XML encoding from the files
    • This does not give any output for texts that are not in Lemmie. It outputs the text between s-tags. The source of the program is available at /mnt/corpus/appl/ling/contrib/src/tei2snt and knowledgeable users are invited to contribute any changes to other users.
  • tr ' ' '\012' - put each word to a separate line
  • egrep '[a-z]' - extract words with at least on lowercase letter (this may be improved)
  • linefreq - counts the frequencies of line types in the input; the output is in an arbitrary order
  • sort -n -r - sorts the frequency list into the reverse numeric order

A skript that extracts words from the Text Bank of Finnish and counts their type frequencies:

FIND |egrep '*.xml'|xargs cat |tei2snt |tr ' ' '\012'|egrep '[a-z]'|linefreq |sort -n -r 
where FIND is one of the following:
find /l/corpus/kielipankki/teksti/fi/FI/*/ -type f -group 
find /l/corpus/kielipankki/teksti/fi/FI/*/ -type f -group sktp-a 
find /l/corpus/kielipankki/teksti/fi/FI/*/ -type f -group sktp-b 

Three word lists: only academic, only business, combined
/fs/metawrk/ling/sanat.aca
/fs/metawrk/ling/sanat.biz
/fs/metawrk/ling/sanat.all
tr -d '\240' < sanat.aca > ! sanat.aca.x
tr -d '\240' < sanat.biz > ! sanat.biz.x
tr -d '\240' < sanat.all > ! sanat.all.x

The first 33000 most frequent words
head -33000 /fs/metawrk/ling/sanat.all.x > ! /fs/metawrk/ling/sanat.all.33000
head -33000 /fs/metawrk/ling/sanat.aca.x > ! /fs/metawrk/ling/sanat.aca.33000
head -33000 /fs/metawrk/ling/sanat.biz.x > ! /fs/metawrk/ling/sanat.biz.33000
gawk '{ print $2; }' < /fs/metawrk/ling/sanat.all.33000 | sort > ! /fs/metawrk/ling/sanat.all.33000.w
gawk '{ print $2; }' < /fs/metawrk/ling/sanat.aca.33000 | sort > ! /fs/metawrk/ling/sanat.aca.33000.w
gawk '{ print $2; }' < /fs/metawrk/ling/sanat.biz.33000 | sort > ! /fs/metawrk/ling/sanat.biz.33000.w

Comparing sanat.biz.33000 and sanat.aca.33000
gawk '{ print $2; }' < /fs/metawrk/ling/sanat.biz | sort > ! /fs/metawrk/ling/sanat.biz.w
gawk '{ print $2; }' < sanat.biz | sort > ! sanat.biz.w
uniq -f sanat.biz.w
uniq -d sanat.biz.w
sort -m sanat.biz.w sanat.all.w | sort -c
sort -m sanat.biz.w sanat.all.30000.w | sort -c
sort -m sanat.biz.w sanat.all.33000.w | sort -c
sort -m sanat.biz.w sanat.all.33000.w | uniq -d > bizall.33000
sort -m bizall.33000 sanat.all.33000.w | uniq -u | wc
sort -m bizall.33000 sanat.all.33000.w | uniq -u
wc bizall.33000
wc sanat.biz
gawk '{ print $2; }' < /fs/metawrk/ling/sanat.aca | sort > ! /fs/metawrk/ling/sanat.aca.w
gawk '{ print $2; }' < sanat.aca | sort > ! sanat.aca.w
sort -m sanat.aca.w sanat.biz.33000.w | uniq -u
sort -m sanat.aca.w sanat.biz.33000.w | uniq -d
sort -m sanat.aca.w sanat.biz.33000.w | uniq -d > sanat.biz.33000.also.aca
wc sanat.biz.33000.also.aca
sort -m sanat.biz.33000.also.aca sanat.biz.33000.w | uniq -u
 /fs/kielipankki/words/sktp>


-- AnssiYliJyra - 10 Jun 2006

HelpdeskForm
HelpdeskProblemName Making word lists
HelpdeskProblemAbstract I want to make word lists from XML encoded texts in the Language Bank of Finland.
HelpdeskUrgency OptionalImprovement
HelpdeskNumberOfUsers 75
HelpdeskDateIssued 2006-06-10
Topic revision: r8 - 2008-11-10 - HennaRiikkaLaitinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback