Back to: SuomenKielipankki:Helpdesk

Helpdesk Item: How I Know the word count of the corpora in use in the WWW-Lemmie 2 tool?

Description of the Problem

The WWW-Lemmie 2 tool allows for selecting the Corpora in Use. The selection is made in the Settings function of the WWW interface. If I search from a subset of all corpora in the Lemmie database, how I know the total size of the set of corpora in use?

The Answer(s)

A Fully Automatic Method

In WWW Lemmie 2, there is no quick help for calculating the total word count of corpora in use.

A Manual Method

In practice, one of the ways to compute the word count is to gather the word counts of individual corpora listed in https://hotpage.csc.fi/su-cgi-bin/appl/ling/lemmie2/myCorpora.cgi (myCorpora inside Lemmie) or http://www.csc.fi/kielipankki/aineistot/ftc.phtml (the public list of the content of the text bank of Finnish).

In some cases, computing the sums manually may take time, but it can be also very useful for the research documentation to write down the names and sizes of individual corpus resources that have been included to the research material.

A Semi-Manual Method

What we did for you:

  1. We managed to cut-and-paste the list in https://hotpage.csc.fi/su-cgi-bin/appl/ling/lemmie2/myCorpora.cgi to an Emacs file. Then, using Emacs' replace-string, we changed the tabulator characters to commas, and continued with some other replace commands to make the format even more compact. Finally, we replaced every new line character (in Emacs C-Q C-J gives ^J) to (C-Q C-M C-Q C-J, that gives ^M^J) and the text file became a valid Notepad text file.
  2. Then we moved the file with SSH FTP to a Windows desktop machine where we opened the file (see the attachment Corpora and words in WWW Lemmie 2.txt) with Notepad and cut-and-pasted the file to MS Word. There we changed the text into a table, added a header and save the document (see the attachment Corpora and words in WWW Lemmie 2.doc).
  3. Finally we cut-and-pasted the table to Excel where it was easy to count the total number of the words in these corpora (see the attachment Corpora and words in WWW Lemmie 2.xls).

What you need to do: You need to

  1. download one of the files,
  2. remove irrelevant lines and
  3. do the calculations either by hand or with a spreadsheet program such as Excel.

The Attachments

This topic contains attachment files:

-- AnssiYliJyra - 21 Aug 2006

HelpdeskForm
HelpdeskProblemName How I Know the Size of Lemmie Corpora in Use?
HelpdeskProblemAbstract If I use Lemmie to search from the Copora in Use as defined in Settings, how I know the total size of the Corpora?
HelpdeskUrgency FullyResponded
HelpdeskNumberOfUsers 100
HelpdeskDateIssued 2006-09-07
Topic attachments
I Attachment Action Size Date Who Comment
Microsoft Word filedoc CorporaAndWordsInWWWLemmie2.doc manage 94.0 K 2006-09-07 - 08:16 AnssiYliJyra  
Texttxt CorporaAndWordsInWWWLemmie2.txt manage 3.7 K 2006-09-07 - 08:16 AnssiYliJyra  
Microsoft Excel Spreadsheetxls CorporaAndWordsInWWWLemmie2.xls manage 23.5 K 2006-09-07 - 08:17 AnssiYliJyra  
Topic revision: r2 - 2006-09-07 - AnssiYliJyra
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback