CORPUS PROCESSING

Up until now, we’ve been working with small text examples, or with the file “_en Portrait of a Lady.not” which is distributed with NooJ. This file, as well as all “.not” files, are text files that are stored in NooJ’s format, which means that it contains the text associated with linguistic information: usually, the text has been delimited into text units; each text unit is associated with a Text Annotation Structure; a set of dictionaries, morphological and syntactic grammars have already been applied to the text.

We see in Chapter 7. how to import and process “external” text files, either “raw” (such as texts edited from MS-Windows Notepad), documents that were created from word-processing programs such as MS-Word, Web pages that were pulled from the Web, etc. NooJ’s .not text files can also be exported as XML documents. In 7.5. we will discuss the construction and management of Corpora with NooJ. A corpus is a set of text files which share the same characteristics (e.g. language, file formats and structure), that NooJ can process as a whole.

Chapter 7. Importing Text Files

NooJ uses its own file format to process texts. Basically, “.not” NooJ text files store the text as well as structural information (e.g. text units), various indices and linguistic annotations in the Text Annotation Structure.

To open a NooJ text file, use the command File > Open > Text.

To create a new text file, use the command File > New Text. NooJ then creates an empty “.not” file that is ready to be edited, saved and processed. Note that in that case, you will have to perform a Linguistic Analysis ( TEXT > Linguistic Analysis) before being able to apply queries and grammars to the text.

Any “.not” file can be modified. In order to edit a .not file, click TEXT > Modify. Note that as soon as you modify a text file, NooJ erases all its Text Annotation Structure. Therefore, you will need to perform a Linguistic Analysis again before being able to apply queries or grammars to it.

NooJ is usually used to work with “external” text files, i.e. with texts that were constructed with another application. We need to import them in order to create a “.not” file and thus to parse them with NooJ.

In order to import a text file, select File > Open > Text, and then, at the bottom of the window, select the option “Files of type: Import Text”.

7.1 The Text's Language

(See A) Although NooJ’s multilingual engine can process texts that contain multilingual contents, it can perform a given linguistic analysis in only one language (each instance of its linguistic engine works in “monolingual mode”). That is to say, any instance of NooJ’s linguistic engine applies dictionaries that belong to one, and only one language. When using NooJ’s Graphical User Interface, the option “Select Language” is used to set the current language, i.e. the set of linguistic data that will be applied to the text when performing the next “Linguistic Analysis” command.

7.2 The Text's file format

(See B) NooJ understands all variants of DOS, EBCDIC, ISCII, ISO, OEM, Windows, MAC and UNICODE character encodings, as well as specific file formats such as Arabic ASMO, Japanese EUC and JIS variants, Korean EUC and Johab, etc.

Figure 1 Import a text

On certain Windows 2000 or XP systems, support for some of these file formats might not be already installed on the PC. For instance, during the installation of the French version of Windows XP, support for Asian languages and file formats is not installed by default. If a language or a file format is not supported by your PC, install them by going to the Windows control panel, then to “Regional and Language Options”, and then by setting the list of supported languages and file formats.

The file format selected by default for importing texts is “ASCII or Byte-Marked Unicode (UTF8, UTF16B or UTF16L)”. That means that characters should be encoded in ASCII (the “US-only” version of ASCII, i.e. 128 characters without accents), or in one of three variants of Unicode, as marked by a special code inserted at the very beginning of the text file.

If you are trying to import a text file that was created with a Windows application such as notepad, you should select the file format “Other raw text formats”: NooJ will select the file format used by your OS by default, e.g. “Western European Windows” on French PCs.

Find out what your text file format is before importing it into NooJ. If you notice that the text you have imported has lost all its accents, chances are that you selected the default “pure ASCII” format, instead of one of its extended version. Close the text file, and then re-import it with its correct file format.

NooJ understands a number of other file formats that are used to represent structured documents, such as RTF, HTML, all variants of MS-WORD (PC, Windows or Mac), as well as a number of “other document file formats” that can be opened by MS-WORD import text functionality, such as WordPerfect, Outlook, Schedule, etc.

In order to process documents represented in any of MS-Word formats, as well as the “other documents” file formats, NooJ actually connects to the MS-WORD application to launch its “text load” functionality. Therefore, importing these documents is only possible if MS-Word is installed on your computer.

7.3 Text Unit Delimeters

(See C) When NooJ parses a text, it processes the Text’s Units (TUs) one at a time. In consequence, NooJ cannot locate a pattern that would overlap two or more text units. It is important to tell NooJ what the text’s units are, because they restrain what exactly NooJ can find in the text.

If one works on Discourse Analysis, to study how sentences are organized together, or tries to solve anaphora that might span over several paragraphs, then Text Units should be as large as possible.

On the other hand, if one studies the vocabulary of a corpus, or even co-occurrences of word forms in a corpus, then Text Units should be as small as possible.

Enter the corresponding Text Unit Delimiter:

(1) No delimiter:

The whole text is seen as one large text unit. NooJ can find patterns that span over the whole text, e.g. can locate the co-occurrence of the first and the last word forms of the text.

This option is particularly useful when texts are small, and the information to be extracted is located everywhere in the text. For instance, a semantic analysis of a technical news bit (weather report, medical statement, financial statement) that aims at producing a semantic predicate such as:

Admission ( Date (3,10,2001),

Location (Toronto),

Patient (Sex (F),Age (50)) )

from a text in which each of the elementary piece of information (location, date, etc.) could be located in different paragraphs, from the very beginning of the text file, to its very end.

(2) Text Units are lines/paragraphs:

This is the default option. NooJ processes the character “New Line” (also noted as “\n” by programmers and as “^p” in MS-Word), or the sequence of two characters “New Line / Carriage Return” (also noted as “\n\r”), as a text unit delimiter.

The “New Line” character is used either as a line break, for instance in poems, or more generally as a paragraph break, by word processors including Windows’ Notepad tool and MS-Word. In the latter case, NooJ will process the text paragraph per paragraph.

(3) PERL regular expression:

This option allows users to define their own text unit delimiter. In the simplest cases, the delimiter could be a constant string, such as “===” in the following example:

This is the

first text unit.

===

This is the second text unit.

===

This is

the third

text unit.

===

This is the fourth text unit.

PERL regular expressions allow users to describe more sophisticated patterns, such as the following:

^[0-9][0-9]:[0-9][0-9][ap]m$

This expression recognizes any line that consists of two two-digit numbers separated by a colon, followed by an “a” or a “p”, followed by “m”, as in the following text:

12:34am

This is the first text unit.

07:00pm

This is the second text unit.

PERL regular expressions can contain special characters such as “^” (beginning of line), “$” (end of line), “|” (or), etc. Look at PERL’s documentation for more information.

7.4 Importing XML Documents

The last option allows users to process structured XML documents, more specifically texts that contain XML-type tags. Here is an example of such texts:

<document>

<page>

<s>this is a sentence</s></page>

<page><s>this is another sentence</s>

<s>the last sentence</s>

</page></document>

This text is structured as one document block of data, inside which there are two page blocks of data. The first page contains one s (“sentence”); the second page contains two sentences. Note that each level of the document structure is spelled as some data between by two tags: one beginning tag: <document>, <page> and <s>, and the corresponding ending tag: </document>, </page> and </s>.

Text Nodes

When parsing such structured documents with NooJ, it is important to tell NooJ where to apply its linguistic data and queries, i.e. what are the nodes in the structured data that include the textual data to be processed.

Typically, a document may contain meta-data such as the author name, the date of publication, and references to other documents, as well as textual information such as an abstract, a text and a conclusion:

<document>

<author>Max Silberztein </author>

<date>August 21, 2006</date>

<abstract>this is an abstract</abstract>

<text>this is a very short introduction</text>

<conclusion>this is the conclusion</conclusion>

</document>

In that case, we would ask NooJ to apply linguistic data and queries to the blocks of textual data such as <abstract>, <text> and <conclusion>, and simply ignore the other data. In order to specify what blocks of textual data we want NooJ to process, we give NooJ the list of all corresponding tags, e.g.:

<abstract> <text> <conclusion>

Multilingual texts

NooJ uses this mechanism to process multilingual texts: for instance, a multilingual text might look like:

<document>

<text-en>this is an English sentence</text-en>

<text-fr>ceci est une phrase franšaise</text-fr>

<text-en>another sentence</text-en>

<text-fr>une autre phrase</text-fr>

...

</document>

In that case, we can open the text as a French “fr” text and then select the corresponding <text-fr> text node to parse, and/or open the text as an English “en” text, and then select the <text-en> blocks of text.

Attribute-Value pairs

XML tags can be associated with attribute-value pairs, that NooJ can process. For instance, instead of having two different XML tags <text-fr> and <text-en>, we might use one single XML tag, say <text> which has two possible values, lang=fr or lang=en. The corresponding document would look like:

<document>

<text lang=en>this is an English sentence</text>

<text lang=fr>ceci est une phrase franšaise</text>

<text lang=en>another sentence</text>

<text lang=fr>une autre phrase</text>

...

</document>

In that case, we would ask NooJ to select the text nodes <text lang=en> or <text lang=fr> for linguistic analysis.

Importing XML information into NooJ’s Text Annotation Structure

When NooJ imports an XML document, it can convert XML tags to NooJ annotations.

By default, XML tags are converted into syntactic/semantic annotations. These annotations are displayed in green in the Text Annotation Structure. The head of the XML tag is converted into a NooJ category, each XML attribute-value pair is converted into a NooJ name-value property field, and each XML attribute is converted into a NooJ feature. For instance, the following XML text:

<DATE>Monday, June 1st</DATE>

will produce NooJ’s annotation <DATE>, associated to the text “Monday, June 1st”. In the same manner, the XML text:

<NP Hum Nb=”plural”>Three cute children</NP>

will produce NooJ’s annotation <NP+Hum+Nb=plural>.

All imported XML tags are translated into syntactic/semantic annotations, except the special XML tag <LU> (LU stands for “Linguistic Unit”). NooJ understands LU’s as Atomic Linguistic Units; LUs require a lemma and a category properties. For instance, the following XML text:

<LU CAT=DET plural>The</LU>
<LU LEMMA=”child” CAT=N plural>children</LU>

will produce the two lexical annotations: <the,DET+plural> and <child,N+plural>.

Notice that the LEMMA is optional: if absent, it is considered as identical to the word form.

The ability to import XML tags as NooJ’s lexical or syntacitc/semantic annotations allows NooJ to parse texts that have been processed with other applications, including taggers.

7.5 Exporting Annotated Text into XML Documents

Note that NooJ can also export its Text Annotation Structure as an XML document.

Chapter 8. Working With Corpora

A corpus is a set of text files that share the same parameters: usually, the language, the structure and the encoding.

NooJ uses its own file format to process corpora. Basically, “.noc” NooJ corpora files store the texts as well as their structural information (e.g. text units), various indices and linguistic annotations in each Text Annotation Structure.

To open a NooJ corpus file, use the command File > Open > Corpus. To create a new corpus file, use the command File > New Corpus. NooJ then creates an empty “.noc” file in which one can import sets of text files.

When creating a new corpus file, the same three parameters that were used to import texts have to be set when creating a corpus file: (A) the Corpus’ language, (B) the corpus’ files’ format, and (3) the texts’ delimiters.

Figure 2 Create a Corpus

All NooJ’s functionalities that can be performed on one text can also be performed at one corpus’ level. It is then possible to compute statistical measures at the characters, tokens, digrams, annotations and unknown levels; it is possible to perform the linguistic analysis ( CORPUS > Linguistic Analysis) at the corpus level, as well as to build concordances at the corpus level.

Note concordances built from a corpus have an extra column that displays the text’s name from which each occurrence was extracted; moreover, the statistical report ( CONCORDANCE > Build Statistical Report) now computes the standard deviation of the number of occurrences for each text file.

Topic revision: r1 - 2009-08-02 - MaxSilberztein
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback