Back to: SuomenKielipankki:Helpdesk
Helpdesk Item: How I Know the word count of the corpora in use in the WWW-Lemmie 2 tool?
Description of the Problem
The WWW-Lemmie 2 tool allows for selecting the Corpora in Use. The selection is made in the Settings function of the WWW interface. If I search from a subset of all corpora in the Lemmie database, how I know the total size of the set of corpora in use?
The Answer(s)
A Fully Automatic Method
In WWW Lemmie 2, there is no quick help for calculating the total word count of corpora in use.
A Manual Method
In practice, one of the ways to compute the word count is to gather the word counts of individual corpora
listed in
https://hotpage.csc.fi/su-cgi-bin/appl/ling/lemmie2/myCorpora.cgi (myCorpora inside Lemmie) or
http://www.csc.fi/kielipankki/aineistot/ftc.phtml (the public list of the content of the text bank of Finnish).
In some cases, computing the sums manually may take time, but it can be also very useful for the research documentation to write down the names and sizes of individual corpus resources that have been included to the research material.
A Semi-Manual Method
What we did for you:
- We managed to cut-and-paste the list in https://hotpage.csc.fi/su-cgi-bin/appl/ling/lemmie2/myCorpora.cgi to an Emacs file. Then, using Emacs' replace-string, we changed the tabulator characters to commas, and continued with some other replace commands to make the format even more compact. Finally, we replaced every new line character (in Emacs C-Q C-J gives ^J) to (C-Q C-M C-Q C-J, that gives ^M^J) and the text file became a valid Notepad text file.
- Then we moved the file with SSH FTP to a Windows desktop machine where we opened the file (see the attachment Corpora and words in WWW Lemmie 2.txt) with Notepad and cut-and-pasted the file to MS Word. There we changed the text into a table, added a header and save the document (see the attachment Corpora and words in WWW Lemmie 2.doc).
- Finally we cut-and-pasted the table to Excel where it was easy to count the total number of the words in these corpora (see the attachment Corpora and words in WWW Lemmie 2.xls).
What you need to do:
You need to
- download one of the files,
- remove irrelevant lines and
- do the calculations either by hand or with a spreadsheet program such as Excel.
The Attachments
This topic contains attachment files:
--
AnssiYliJyra - 21 Aug 2006