GETTING STARTED

This first section presents NooJ and its applications, takes you through the installation process, and then gives you the minimum amount of information necessary to launch a basic search in a text.

CHAPTER 1.WELCOME

1.1. INTRODUCTION

NooJ is a development environment used to construct large-coverage formalized descriptions of natural languages, and apply them to large corpora, in real time. The descriptions of natural languages are formalized as electronic dictionaries, as grammars represented by organized sets of graphs.

NooJ supplies tools to describe inflectional and derivational morphology, terminological and spelling variations, vocabulary (simple words, multi-word units and frozen expressions), semi-frozen phenomena (local grammars), syntax (grammars for phrases and full sentences) and semantics (named entity recognition, transformational analysis).

NooJ is also used as a corpus processing system: it allows one to process sets of (thousands of) text files. Typical operations include indexing morpho-syntactic patterns, frozen or semi-frozen expressions (e.g. technical expressions), lemmatized concordances and the statistical study of the results.

1.2.System requirements

NooJ is a .NET application. It currently runs under Windows 95-98-ME, Windows NT-2000, Windows XP and Windows VISTA, although some of its functionalities (e.g. UNICODE and XML support) are only available with Windows 2000, Windows XP and Windows VISTA. As for any application, we strongly advise that you update both your operating system and the .NET Framework, by downloading their latest “Service Pack”.

The MONO and the DOTGNU projects aim at building a .NET computing environment (i.e. virtual machine) for LINUX, FreeBSD, Mac OSX as well as several variants of UNIX. We have successfully tested noojapply.exe on MONO, but as of now NooJ.exe does not run yet on MONO. When these projects are completed, NooJ will run under these OS as well. For more information, see:

http://www.mono-project.com and http://www.dotgnu.org

Minimum requirements for a computer to run NooJ on small texts (less than one Mega byte) are not very high: 512 Mb of RAM, 1 GB available on the hard drive.

If you plan to use NooJ to parse large corpora (hundreds or thousands of text files), or to compile large-coverage dictionaries (tens of thousands of entries or more), the minimum configuration should be higher: PC with Pentium 4 or equivalent, 2 GB RAM or more.

If you are planning to use NooJ to develop large sets of local grammars (hundreds of graphs), a good screen is necessary: at least a 19 inch screen, with a 1600x1024 16-bit resolution, and a minimum of 80 Hz refresh rate.

1.3.NooJ after INTEX

INTEX’s technology was based on my thesis work (1989), and its first version was released in 1992. Between 1992 and 2002, INTEX has substancially evolved, mostly “organically”, in response to the needs of its users. Moreover, along the years, INTEX’s technology has become obsolete: written in C/C++, it was monolingual, could only handle one single file format, one text file at a time, no support for XML, etc.

In 2002, I decided to rebuild a “new and better INTEX”: NooJ. Because of several reasons, including the fact that INTEX’s technology was obsolete, INTEX’s thousands of C functions were a big mess, and, perhaps most importantly, I was just plainly sick of it! I decided to write the new application from the ground up, without even looking at INTEX’s source. However,

(1) I have used my 10-year experience as the INTEX’s author to redesign the whole system, making sure to redesign every single “good” feature, while avoiding to make the same mistakes, e.g. when I had to hack INTEX to fit in a new, important functionality that should not be a hack in the first place.

(2) I have made sure that NooJ, although different in its software architecture and linguistic methodology, was as compatible as possible with INTEX, so that most linguistic resources could be reused with very little modification. Note that in general, linguists’ work is much easier with NooJ than with INTEX. The largest differences between INTEX and NooJ are always in the direction of simplification: for instance, NooJ has no more DELAF/DELACF dictionaries, NooJ’s grammars are self-contained (instead of being made of dozen of different files), morphological and syntactic grammars are sharing the same linguistic engine, etc.

To program NooJ, I decided to follow a Component-Based Software approach, which is a step beyond the Object-Oriented Programming paradigm. I first tried to use the JAVA/J2EE framework; for a number of technical reasons related to optimization problems with the various JAVA virtual machines that were available in 2002 on UNIX, LINUX and Windows machines, and also, just because of personal taste, I then decided to switch to the C#/.NET framework (which is, in my humble opinion, more elegant and more fun!). The .NET framework gives NooJ a number of great functionalities, including the automatic management of hundreds of text encodings and formats, a native XML compatibility, both for parsing XML documents and to store objects (XML/SOAP); the ASP.NET library allows NooJ to be easily transformed into a WEB server application, .NET Services and Remoting technology allows NooJ’s functionalities to be available as independant agents that run in parallel, etc.

NooJ is based on a brand new linguistic engine, capable of processing multi-lingual texts and corpora, in over 100 file formats.

NooJ’s non-destructuve linguistic engine based on an annotation system: NooJ, as opposed to INTEX, never modifies the texts it is parsing (no more REPLACE or MERGE mode). Therefore, NooJ is an ideal tool to perform a large number of operations in cascade or in parallel. Grammar writers need no longer to enter meta-data in their grammar to take previous analyses into account, and NooJ grammars, as opposed to INTEX’s, should stay small and truely independant from each others (INTEX’s grammars had the tendancy to become huge very quickly because there was no simple way to adapt them to each particular need).

NooJ’s dictionaries are a great enhancement over INTEX DELA-type dictionaries as well as lexicon-grammar tables. NooJ’s dictionaries are similar to DELAS-DELAC dictionaries (no more difference between simple and compound words) and can represent spelling and terminological variants as well (no more DELAV dictionaries). NooJ does not need DELAF-DELACF type “inflected dictionaries” because it processes word inflection transparently, including compound word inflection (INTEX did not process compound word inflection). Moreover, NooJ generalizes inflectional morphology to derivational morphology, so that derivations are formalized in a very natural way (there was no provision for derivational morphology in INTEX).

NooJ’s integration of morphology and syntax allows NooJ to perform morphological operations inside syntactic grammars: for instance, it is possible to ask NooJ to locate all verbs conjugated in the present, third person singular, and replace them with the word form “is” followed by the verbs in their Past Participle form. Now we can write and perform automatic transformations on large texts!

Finally, NooJ processes lexicon-grammar tables without meta-graphs (INTEX’s meta-graphs were very cumbersome to use, could not be merged with phrasal syntactic grammars, and more importantly, could not be merged together because the explosion of the size of their compiled form).

In conclusion, I have used my 10-year experience as the INTEX designer, programmer as well as a simple INTEX user, to rebuild not only a “much better” INTEX, but a complete new platform. I believe that INTEX users should be able to switch easily to NooJ, and I hope that the new engine and functionalities will give other linguists and NLP developers new reasons to discover NooJ!

1.4.Programming NLP applications with NooJ

One can easily build prototypes that contain powerful NooJ functionalities.

In its Standard edition, NooJ’s functions are available via a command-line program: noojapply.exe, which is stored in NooJ’s _App directory along Nooj.exe. noojapply.exe can be called either directly from a “SHELL” script, or from more sophisticated programs written in PERL, C++, JAVA, etc.

noojapply.exe allows users to apply to texts and corpora dictionaries and grammars automatically.

Note INTEX users: noojapply.exe provides the same functionalities as the 30+ programs that constitued the INTEX package, in a much more efficient way. For instance, noojapply processes any number of texts (instead of a single one), it compiles deterministic grammars dynamically, etc.

If you are planning to use NooJ’s functionalities in a professional environment (e.g. build a linguistic research engine), note that they are also available via:

-- a .NET dynamic library, noojengine.dll, constituted by a set of public object classes and methods. These classes and methods can be used by any .NET application, in any NET programming language. noojengine.dll allows users to build sophisticated applications such as WEB services, and can be much used to build much more efficient NLP applications than noojapply.exe.

-- a noojservice.exe / noojclient.exe client-server application, based on a Windows service, that provides NooJ’s morphological and syntactic parsers functionalities in a Multi-Agent System, that can be used to build a massively parallel NLP application.

1.5.Text Annotation Structure

NooJ’s linguistic engine uses an annotation system. An annotation is a pair ( position, information) that states that a certain sequence of the text has certain properties. When NooJ processes a text, it produces a set of annotations, stored in the Text Annotation Structure (TAS); annotations are always kept synchronized with the original text file, which is never modified. Annotations can be associated to single word forms (e.g. to annotate “table” as a noun), to parts of word forms (e.g. to annotate “not” in “cannot” as an adverb), to multi-word units (e.g. to annotate “round table” as a noun) and discontinuous expressions (e.g. to annotate “take X into account” in “John took the problem into account”).

NooJ morphological and syntactic parsers provide tools to add annotations to a TAS, to remove (filter out) annotations from a TAS, to export annotated texts and corpora as XML documents, as well as to parse XML documents and import certain of their XML tags into NooJ’s own TAS.

1.6.Computational devices

NooJ’s linguistic engine includes several computational devices used both to formalize linguistic phenomena and to parse texts.

Finite-State Transducers

A finite-state transducer (FST) is a graph that represents a set of text sequences and then associates each recognized sequence with some analysis result. The text sequences are described in the input part of the FST; the corresponding results are described in the output part of the FST.

Typically, a syntactic FST represents word sequences, and then produces linguistic information (such as its phrasal structure). A morphological FST represents sequences of letters that spell a word form, and then produces lexical information (such as a part of speech, a set of morphological, syntactic and semantic codes).

Finite-State Automata (FSA)

In NooJ, Finite-State Automata are a special case of finite-state transducers that do not produce any result (i.e. they have no output). NooJ’s users typically use FSA to locate morpho-syntactic patterns in corpora, and extract the matching sequences to build indices, concordances, etc.

Recursive Transition Networks (RTNs)

Recursive Transition Networks are grammars that contain more than one graph; graphs can be FST or FSA, and also include references to other, embedded graphs; these latter graphs may in turn contain other references, to the same, or to other graphs. Generally, RTNs are used in NooJ to build libraries of graphs from the bottom-up: simple graphs are designed; then, they are re-used in more general graphs; these ones in turn are re-used, etc.

Enhanced Recursive Transition Networks (ERTNs)

Enhanced Recursive Transition Networks are RTNs that contain variables; these variables typically store parts of the matching sequences, and then are used to perform some operation with them (e.g. put their content in the plural,, etc.), and then produce the resulting output.

Because variables can be duplicated, inserted and/or displaced in the output, ERTNs give NooJ the power of performing linguistic transformations on texts. Examples of transformations include negation, passivization, nominalization, etc.

Regular Expressions

Regular Expressions constitute also a quick way to enter simple queries without having to construct grammars. When the sequence to be located consists of a few words, it is much quicker to enter these words directly into a regular expression. However, as the query becomes more and more complex as is usually the case in Linguistics, one should build a grammar.

Context-Free Grammars (CFGs)

In NooJ, CFGs constitute an alternative means to enter morphological or syntactic grammars.

For instance, NooJ includes an inflectional/derivational module that is associated with its dictionaries, so that it can automatically link dictionary entries with their corresponding forms that occur in corpora (this functionality allows NooJ to get rid of INTEX’s full form dictionaries such as DELAF and DELACFs).

NooJ dictionaries generally associate each lexical entry with an inflectional and/or derivational paradigm. For instance, all the verbs that conjugate like “aimer” are linked to the paradigm “+FLX=AIMER”; all the verbs that accept the “-able” suffix are linked to the paradigm “+DRV=ABLE”, etc.

Paradigms such as “AIMER” or “ABLE” are described either graphically in RTNs or by CFGs in text files.

1.7.Linguistic Resources

With NooJ, linguists build, test and maintain two basic types of linguistic resources:

-- Dictionaries ( .dic files) usually associate words or expressions with a set of information, such as a category (e.g. “Verb”), one or more inflectional and/or derivational paradigms (e.g. how to conjugate verbs, how to nominalize them), one or more syntactic properties (e.g. “+transitive” or +N0VN1PREPN2), one or more semantic properties (e.g. distributional classes such as “+Human”, domain classes such as “+Politics”). Lexical Properties can be binary, such as “+plural” or can be expressed as an attribute-value pair, such as “+gender=plural”. Values can belong to the meta-language, such as in “+gender=plural”, to the input language such as in “+synonym=pencil” or to another language, such as in “+FR=crayon”.

NooJ’s dictionaries constitute a converged and enhanced version of the DELA-type dictionaries that were used in INTEX: a NooJ dictionary can include simple words (like a DELAS), multi-word units (like a DELAC) and can link lexical entries to a canonical form (like a DELAV). Contrary to INTEX, NooJ does not need full inflected form dictionaries (no more DELAF or DELACF).

NooJ’s ability to type pieces of information (e.g. “masculine” is a value of the “gender” property) allows it to process lexicon-grammar tables as well. Indeed, NooJ can display any dictionary in a “list” form or in a “table” form.

-- Grammars are used to represent a large gamut of linguistic phenomena, from the orthographical and the morphological levels, up to the syntagmatic and transformational syntactic levels.

In NooJ, there are different types of grammars. NooJ’s three types of grammars are:

(a) Inflectional and derivational grammars ( .nof files) are used to represent the inflection (e.g. conjugation) or the derivation (e.g. nominalization) properties of lexical entries. These descriptions can be entered either graphically or in the form of rules.

(b) Lexical, orthographical, morphological or terminological grammars ( .nom files) are used to represent sets of word forms, and associate them with lexical information, e.g. to standardize the spelling of word or term variants, to recognize and tag neologisms, to link synonymous expressions together;

(c) Syntactic or semantic grammars ( .nog files) are used to recognize and annotate expressions in texts, e.g. to tag noun phrases, certain syntactic constructs or idiomatic expressions, to extract certain expressions or interest (name of companies, expressions of dates, addresses, etc.), or to disambiguate words by filtering out some lexical or syntactic annotations in the text.

1.8. NooJ’s Community

NooJ can be freely downloaded from http://www.nooj4nlp.net. Most laboratories and academic centers use NooJ as a research or educational tool: some users are interested by its Corpus processing fuctionalities (analysis of literary text, research and extract information from newspapers or technical corpora, etc.); others use NooJ to formalize certain linguistic phenomena (e.g. describe a language’s morphology), others for computational applications (automatic text analysis), etc.

Visit NooJ’s WEB site at http://www.nooj4nlp.net, and NooJ’s forum at http://groups.yahoo.com/group/nooj-info to learn more about NooJ, its applications and its users.

Among NooJ users, some are actively helping the NooJ project, by giving away some of their linguistic resources, projects or demos, labs, tutorials or documentations. These users, who constitute “NooJ’s community”, should be considered as NooJ’s “co-authors”. The Community Edition of the NooJ application (which is also free), is an extended version of NooJ, that gives full access to its internal functionalities as well as priviledged access to sources of its linguistic resources.

NooJ users meet once a year at the NooJ conference. NooJ tutorials and workshops are regularly organized during the year.

1.9. Structure of the book

This book is divided into five parts:

This section “Getting Started” presents NooJ (Chapter 1.), takes you through the installation process (Chapter 2.), and then helps you launch a basic search in a text (Chapter 3.);

The section “Regular expressions and graphs” shows you how to carry out simple searches in texts with regular expressions (Chapter 4.), how to use lexical resources for linguistic requests (Chapter 5. ), and how to use NooJ’s graph editor to describe more complex queries (Chapter 6.);

The section “Text Processing” explains how to import and process texts (Chapter 7.), and how to contruct and process corpora (7.5. ). NooJ can import texts in over 100 file formats (all variants of ASCII, EBCDIC, ISO, Unicode, etc.), documents in a dozen formats (all variants of MS-WORD, HTML, RTF, etc.). NooJ can also process XML documents (see Chapter 8.). In the latter case, XML tags can be imported into the Text’s Annotation Structure, and a NooJ annotated text can be exported back to an XML document.

The section “Lexical analysis” describes NooJ dictionaries (Chapter 9. ), NooJ’s inflectional and derivational tools (Chapter 10.), and productive morphological grammars (Chapter 11. ). NooJ uses these three tools to perform an automatic Lexical Parsing of texts (Chapter 12. ).

The section “Syntactic Analysis” presents local grammars, and how to build libraries of graphs in order to build a bottom-up description of Natural languages using local grammars and remove lexical ambiguities (Chapter 13.). NooJ’s parser’s behavior is complex; we explain how it processes ambiguous and empty sequences, and how to set its priorities (Chapter 14. ). Finally, we present more powerful grammars, such as enhanced RTNs used to perform automatic transformational analyses and translations (Chapter 16. ).

The section “References” presents a number of tools aimed at teaching Linguistics, Corpus Linguistics and Computational Linguistics (Chapter 17.). We then describe every menu item and functionality (Chapter 18. ). Finally, we present NooJ’s standalone command-line program noojapply.exe, that can be directly used from a command-line “DOS” windows or a UNIX Shell environment. Chapter 20. is the bibliography.

CHAPTER 2. INSTALLING THE SOFTWARE

2.1.Installing NooJ

You can freely download NooJ from the NooJ Web site: http:// www.nooj4nlp.net.

NooJ’s installation is straightforward and is based on the XCOPY model: simply copy NooJ’s application folder on your computer. No “SETUP” nor any modification of Windows’ registry is necessary! Moreover, you do not need to have “Administrative” rights on a computer to install NooJ: you just need to be able to copy it, anywhere on the computer.

Go to NooJ’s Web site’s “Download” page, then download the file “NooJ2.zip”. Uncompress it, for instance into your desktop, or into the usual application folder “c:\Program files”, or into any other folder you wish. In the resulting “NooJ” folder, there is a “_App” folder; in this latter folder, locate the application file “NooJ.exe”. You might want to make a shortcut to this file, and store the shortcut either in Windows’ “Start” menu, or on the desktop.

IMPORTANT: NooJ is based on the .NET Framework technology. Before proceeding any further, make sure that .NET Framework 2.0 (or above) is already installed on your PC.

2.2.NooJ’s files’ extensions

If you wish, you can associate NooJ’s files’ types with the NooJ application, so that double-clicking these files will launch NooJ and open them automatically. The following file extensions can be associated with NooJ:

.DIC (dictionary)

.NOC (corpus)

.NOF (inflectional/derivational morphological grammar)

.NOG (syntactic grammar)

.NOM (productive morphological grammar)

.NOP (project)

.NOT (text)

2.3.Installing new modules for NooJ

The “Nooj2.zip” package includes two language modules: the English and French standard modules. Members of NooJ’s community have been posting other modules for NooJ, including modules for Arabic, Western Armenian, Chinese, Hebrew, Italian, Latin, Spanish, etc. Look for http:// www.nooj4nlp.net (then go to the Community page) for the latest resources.

To add a module for NooJ, simply download the corresponding .zip file from the Community page then extract its content into the NooJ folder, so that the new folder is at the same level as the standard “en” and “fr” folders. The new folder should contain the three sub-folders “Lexical Analysis”, “Projects” and “Syntactic Analysis”.

2.4. Registering NooJ’s Community Edition

NooJ’s standard edition does not require any registration and can be used freely. NooJ’s community edition is mainly used by researchers of the NooJ Community, i.e. people who actively help NooJ’s project and community. As NooJ’s project is very ambitious (formalize natural languages from the orthographic level up to semantics), there are many ways to help us! If you do wish to use the Community Edition, you will need to register. Contact NooJ’s author:

max.silberztein@univ-fcomte.fr

for more information about the Community edition.

To run NooJ in the Community mode, go to the “Info” menu, click “About NooJ”, select the “Community” option then enter your information (contact, institution, license key).

2.5. Personal folder

NooJ software is stored in NooJ’s application folder. NooJ’s personal folder is the default folder in which NooJ stores all your personal data. In general, Windows sets your NooJ personal folder to be located in your “My documents” folder:

My documents\NooJ

Your personal Windows settings may vary. For each language you are working with, NooJ creates one sub-folder (e.g. “en” for English, “fr” for French, etc.):

My documents\NooJ\en

My documents\NooJ\fr

in which it stores the corresponding linguistic resources. Each language folder in turn contains three embedded sub-folders: one to store lexical resources, one to store syntactic and semantic resources, and one to store corpora, projects and texts, e.g.:

My documents\NooJ\en\Lexical Analysis

My documents\NooJ\en\Projects

My documents\NooJ\en\Syntactic Analysis

2.6. Preferences

NooJ’s behaviour is based on a number of default parameters that are set via the Info > Preferences control panel. There, you can set your default working language, default fonts to display texts and dictionaries, what lexical and syntactic resources are applied for each language, etc.

2.7.Updates

NooJ’s computational functionalities and linguistic resources are updated regularly. To upgrade the software, just replace the NooJ application folder (the one you may have stored on your desktop or in C:\Program files), with the latest version available at the NooJ web site:

http://www.nooj4nlp.net

Your own data should be stored in your personal folder, by default: “My Documents\Nooj”. NEVER, EVER store any of your data in the NooJ application folder, as it might be lost at the next update.

which name starts with the prefix “_”. If you do so, make sure to rename the files when you save them ( File > Save As) so that they do not get destroyed at the next upgrade.

2.8. The nooj-info forum

Check out the NooJ forum regularly at:

http://groups.yahoo.com/group/nooj-info

to be kept informed about major updates, availability of new modules, as well as FAQs, tips, etc.

If you subscribe to this group, make sure to set the list to automatically send any messages to your regular email address rather than the new Yahoo email address that Yahoo gives you when you register.

NooJ’s Web site contains a History page, available from the DOWNLOAD page, in which important new functionalities are regularly described.

2.9. Uninstalling NooJ

If you wish to uninstall the software, simply delete the NooJ application folder (i.e. the one you created when you installed the software).

Note that your data is stored in your user folder ( not in the NooJ system folder), therefore it will not be deleted. If you want to delete all NooJ’s linguistic resources as well, you can delete the user folder (usually in My Documents\NooJ).

CHAPTER 3. QUICK START

This first section presents NooJ and its applications, takes you through the installation process, and then gives you the minimum amount of information necessary to launch a basic search in a text.

This first section presents NooJ and its applications, takes you through the installation process, and then gives you the minimum amount of information necessary to launch a basic search in a text.

CHAPTER 3. QUICK START

First we learn how to use one of NooJ’s most basic functions: the ability to locate words and expressions in a text.

3.1. Loading a text

If you have not yet done so during the installation process, make sure to create a short cut, on your desktop or in your Start menu, to the file “Nooj.exe” located in NooJ’s “_App” folder, e.g.:

c:\Program files\NooJ\_App\Nooj.exe

Launch NooJ. Now click the menu items: File > Open > Text. You should see a few text files, with the extension “.NOT” (for “NooJ Text”). Select the file “_The Portrait Of A Lady.not” (the novel by Henry James). The text will load and you should see a window like the one below:

Figure 1. Loading the text “The Portrait of a lady”

At this point, default linguistic resources have already been applied to the text. NooJ produces some indications, displayed above and to the right of the text window:

Language is “English (United States)(en)”.

Text Delimiter is: “\n” (NEW LINE)

Text contains 4646 Text Units (TUs).

285993 tokens including:

233102 word forms

1249 digits

527659 delimiters

Text contains 1013374 annotations.

First of all, a few definitions:

Letters are the elements of the alphabet of the current language. Digits are the ten digit characters (from “0” to “9”). The Blank in NooJ represents any sequence of spaces, tabulation characters, NEWLINE and CARRIAGE RETURN. Delimiters are all the other characters.

From these definitions, NooJ uses the following definitions:

Tokens are the basic linguistic objects processed by NooJ. They are classified into three types: Word Forms are sequences of letters between two delimiters; Digits; and Delimiters. Digrams are pairs of word forms (we ignore the delimiters between them).

Note that NooJ processes digits and delimiters both as characters and as tokens.

When processing certain Asian languages, NooJ processes individual letters (rather than sequences of letters), as tokens.

Some unusual examples:

For NooJ, the sequence “o’clock” is constituted of three tokens: the simple form “o”, followed by the delimiter “ ‘ “, followed by the simple form “clock”. Similarly the adverb “a priori” is made up of two tokens (blanks do not count).

The sequence “3.14” is made up of one digit, one delimiter, and then two digits, that make four tokens. The sequence “PDP/11” is made up of the simple form “PDP”, followed by the delimiter “/”, followed by two digits (which make four tokens).

Our text has 233,102 word forms, 1,249 digits (that low number is characteristic for litterary texts) and 52,765 delimiters (i.e. roughly one punctuation character every 5 word forms).

Double click in the Results window (the little area above the text), to display the lists of the text’s Characters, Tokens and Digrams:

Figure 2. Several results: the text’s characters, tokens and digrams

The most frequent characters are the space character (it occurs 219,484 times in this text) and the letter “e” (it occurs 124,550 times).

The most frequent tokens in this text are the word form “the” (7,575 times), and “to” (7,295 times).

The most frequent digrams are the sequence “of the” (981 times) and “don t” (734 times). The digram “don t” corresponds to the sequence with the apostrophe “don’t”: NooJ does not compute the two digrams (don, ’) and (’,t). Hapaxes, i.e. digrams that only occur once, are not displayed.

Digrams are sequences of word forms, i.e. delimiters are simply ignored in digrams.

The three lists (characters, tokens and digrams) can be sorted alphabetically, from left to right, or from right to left, or according to the frequency of each item.

Based on the previous definitions of characters and tokens, NooJ defines Atomic Linguistic Units (ALUs).

NooJ processes four types of Atomic Linguistic units:
-- Affixes(prefix, proper affix or suffix) are the smallest sequences of letters included in word forms that must be associated with relevant linguistic data, e.g. re-, -ization. In NooJ, they are described by inflectional rules, derivational rules or (productive) morphological grammars;
-- simple words are word forms that are associated with relevant linguistic information, e.g. table. They are usually described in dictionaries;
-- multi-word units are sequences of tokens (word forms, blanks, delimiters and/or digits) associated with relevant linguistic information, e.g. as a matter of fact. They are usually described in dictionaries;
-- expressions are potentially discontinuous sequences of word forms that are associated with relevant linguistic information, e.g. take ... into account . They are described either in dictionaries or in syntactic grammars.

Do not confuse the two terms “word forms” (a type of token) and “simple words” (a type of atomic linguistic unit represented in a NooJ linguistic resource).

For instance, the two word forms “THE” and “the” are different tokens. They are usually associated with a unique Atomic Linguistic Unit ( simple word) because there is only one lexical entry “the = determiner” in NooJ’s dictionaries that matches both word forms.

However, the single French word form “PIERRE” might correspond to two different ALUs (simple words) because NooJ can link the token to two dictionary entries “pierre” (noun meaning “stone”), and “Pierre” (a French firstname).

Note that the word form “pierre” will be linked to the only dictionary entry “pierre,N”, not to the firstname which must be written in uppercase. See NooJ’s case conventions in the dictionary section.

NooJ has inserted annotations in this Text’s Annotation Structure:

Text contains 809111 annotations.

An annotation is a pair ( position, information) that states that a certain sequence in the text (located at a certain position) is associated with some information. Annotations can be added to the Text Annotation Structure by three mechanisms:

-- NooJ’s lexical parser adds four types of annotations to the text, corresponding to the four types of Atomic Linguistic Units (affixes, simple words, multi-word units and expressions).

-- NooJ’s syntactic parser also can add annotations to, or remove annotations from, the Text Annotation Structure.

-- NooJ can process XML documents, in which case certain XML tags can be imported as annotations into the Text Annotation Structure.

Note that at this stage, NooJ’s lexical parser has produced a high level of ambiguity (233,000 word forms produce 809,000 annotations), typical for English texts. We will need a good syntactic component to lower the level of ambiguities.

Let’s look at the results of the lexical analysis: above the text window, double-click the “Annotations” (to display the information that is being associated to the text), and then the “Unknowns” results (to display the word forms that have not been associated with any annotations):

Figure 3. Lexemes and Unknowns

NooJ displays these two lists in the native NooJ dictionary format. These two windows can be edited, typically, in order to replace the code “UNKNOWN” with something more useful.

We will see later the signification of the codes.

3.2. Locating a word form

In the TEXT menu, click “Locate”. The “Locate Panel” window will show up. In the field “Pattern is:”, select the option “a NooJ regular expression:” (you will enter a regular expression), then type “perhaps” in the field (A). Then click a colored button in the lower right corner (B) of the window, for instance the red one. The search operation is launched.

Figure 4. Locate a word

NooJ lets you know that it found 100 matches for your query, and then displays a concordance in the selected color.

Figure 5. Concordance of the word “perhaps”

Double-clicking one entry of the concordance makes NooJ display the corresponding matching occurrence within the text.

The concordance of a sequence is an index that represents all of its utterances in context. NooJ concordances are displayed in four columns: each occurrence being presented in the middle column, between its left and its right context. If a corpus (i.e. a set of text files, rather than a single text) is being indexed, the first column displays the text file name in which each match occurs.

You can vary the size of the left and right context, as well as the order in which the concordance is sorted.

The cursor (generally an arrow) becomes a hand when it hovers above the concordance; if you click on a match and the text window is open, NooJ displays the matching occurrence within the text. Note that clicking the header of the “Before” context makes NooJ sort the concordance from the end of the preceding word forms.

Topic revision: r4 - 2009-06-27 - MaxSilberztein
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback