The first level of text analysis requires the computer to identify the Atomic Linguistic Units (ALUs) of the text, i.e. its smallest non-analysable elements. In chapter 9, we define these ALUs, and we show how NooJ’s dictionaries are used to describe them. In Chapter 10, we describe NooJ’s inflectional and derivational morphological module; in chap. 11, we present NooJ lexical and morphological grammars. In Chapter 12, we present NooJ’s lexical parser, which uses dictionaries, the inflectional and derivational engine, and morphological grammars to annotate words in texts.

Chapter 9. NooJ Dictionaries

9.1 Atomic Linguistic Units(ALUs)

Atomic Linguistic Units (ALUs) are the smallest elements that make up the sentence, i.e. the non-analyzable units of the language. They are “words” which meaning cannot be computed nor predicted: one must learn them to be able to use them. For instance, one need to learn how to say “chair” or “French fries” in another language: there is no linguistic property for these words that could be used to predict or compute their correct translations. Moreover, even for the purpose of analysing only English texts, one has to describe explicitely all the syntactic and semantic properties of these words, e.g. Concrete, Vegetable, Food, etc. because none of these properties can be deducted from their spelling or their components. Therefore, it is crucial to describe these ALUs explicitely.

It is the role of NooJ’s lexical analysis module to represent ALUs. From a formal point of view, NooJ separates the ALUs into four classes:

Simple Words are ALUs that are spelled as word forms. For example, apple, table.

Affixes are ALUs that are spelled as sequences of letters, usually inside word forms. For example: dis- and -ization in the word form “disorganization”.

Multi-word Units are ALUs that are spelled as sequences of word forms. For example: round table (when that means “meeting”), nuclear plant (when that means “Electricity producing industrial plant”).

Frozen Expressions are ALUs that are spelled as potentially discontinuous sequences of word forms. For example: take ... into account.

The terminology used here is natural for English and Romance languages such as French, Italian or Spanish; in the Natural Language Processing community, researchers use both terms “compound words” and “multi-word units” indistinctly. However, the term “compound word” is less well suited for Germanic languages, where we are accustomed to using the term “compound word” to refer to an analyzable sequence of letters (In NooJ, multi-word units or compound words are sequences of non-analyzable sequences of word forms). This terminology is irrelevant for Asian languages such as Chinese, where there are no formal differences between affixes, simple words and multi-word units, because there is no blank between word forms.

What is important to consider, however, is that no matter what the language, ALUs can always be grouped into four classes or less, which each correspond to an automatic recognition program within NooJ’s lexical parser: in order to identify the first and third types of ALUs (simple words and multi word units), NooJ will look up its dictionaries. The second type of ALUs (“affixes”) will be identified by breaking word forms into smaller units. NooJ will recognize the fourth type by applying syntactic grammars.

Caution: do not confuse the terms simple word (which is a type of ALU) and word form (which is a type of token). A word form is a sequence of letters that does not necessarily correspond to an Atomic Linguistic Unit: for example, “excthz” is a word form, but not an English word; “redismountability” is a word form, but not an ALU because it can be reduced into a sequence of ALUs: “re”, “dis”, “mount”, “able”, “ility”. A simple word is, by definition, an ALU, i.e. the smallest non-analysable element of the vocabulary.

The recognition of an ALU by the NooJ’s lexical parser does not imply that the ALU truly occurs in the text. For example, the fact that we find the sequence of two word forms: “round table”, in NooJ’s dictionary does not necessarily mean that the multi-word unit meaning “a meeting” is indeed present in the text. Such is also the case for morphological analysis and the referencing of the simple word dictionaries: the simple form retreat would be considered ambiguous (Napoleon’s retreat vs John retreats the problem).

In NooJ, click Info > Preferences, select the English language “en”, then click the tab Lexical Analysis. The page that is displayed contains two areas: Dictionary and Morphology. The Dictionary zone contains all the lexical resources that are used to recognize simple words and multi-word units; the morphology zone displays all the morphological grammars that are used to recognize word forms from their components (prefixes, affixes and suffixes).

Figure 1. Resources used to recognize atomic linguistic units

NooJ dictionaries are used to represent, describe and recognize simple words and multi-word units. Dictionaries are “.nod” files that are compiled from editable “.dic” source files.

Technically, “.nod” files are finite-state transducers; we speak of them as compiled dictionaries because, from the point of view of the end user, they come out of editable text “.dic” type documents that were compiled using the Lab > Dictionary > Compile command.

NooJ offers two equivalent tools to represent and describe inflectional and derivational morphology:

Inflectional / derivational descriptions are organized sets of Context-Free rules that describe morphological paradigms.

Inflectional / Derivational grammars are structured sets of graphs that describe morphological paradigms.

Both sets of rules are stored in NooJ inflectional/derivational grammars, in “.nof” files. These descriptions are lexicalized, i.e. each lexical entry of a NooJ dictionary is associated with one or more inflectional and derivational paradigms.

Finally, NooJ offers also morphological grammars (“.nom” files) to recognize certain word forms, for which an inflection or a derivational description is not adequate. For instance, in Arabic or in Hebrew, the preposition and the determiner are concatenated to the noun: we don’t want to describe these complex word forms as inflected variants of the noun, but rather as three ALUs. In the same manner, Germanic languages have free noun phrases that are spelled as word forms; NooJ needs to segment these noun phrases into sequences of ALUs. Even in English, word forms such as “cannot” need to be parsed as sequences of ALUs (i.e. “can” “not”).

Attention INTEX users: NooJ dictionaries, as opposed as INTEX’s, contain the full description of the inflection and the derivation of their entries. This explains why NooJ does not need DELAF-type dictionaries any more. More importantly, whether INTEX’s lookup process could only lemmatize word forms (e.g. “was” → “be”), NooJ’s entries can generate any derived or inflected form of any word form (e.g. “was” → “been”)

Finally, we will see that frozen expressions (the fourth type of ALUs), as well as semi-frozen expressions, are described in NooJ’s Syntactic component: click Info > Preferences, select the English language “en”, then click the tab Syntactic Analysis. The page that is displayed contains a list of syntactic grammars. We will see later how to formalize frozen expressions.

9.2 Dictionaries's Format

Attention INTEX users: NooJ dictionaries are a unified version of DELAS, DELAC and DELAE dictionaries. NooJ does not need DELAF nor DELACF dictionaries. NooJ dictionaries contain indistinctly both simple words and compound words (aka multi-word units), and can represent spelling or terminological variants of lexical entries.

Generally, the dictionary of a given language contains all of the lemmas of the language, and associates them with a morpho-syntactical code, possible syntactic and semantic codes, and inflectional and derivational paradigms. Here, for example, are some entries from the English dictionary:







a lot of,DET+p


The first line represents the fact that the word “a” is a determiner ( DET), in the singular form ( +s). The second line describes the word “aback” as an adverb ( ADV). Note that the word “abandon” functions either as an noun ( N) or as a verb ( V), therefore it is duplicated: NooJ dictionaries cannot contain lexical entries with more than one syntactic category. In consequence, NooJ’s lexical parser will process the word form “abandon” as ambiguous.

Lexical ambiguity: When a word is associated with different sets of properties, i.e. different syntactic or distributional information, we must duplicate the word in the dictionary. The corresponding word form will be processed as ambiguous.

In NooJ dictionaries, all linguistic information, such as syntactic codes such as “+tr” (transitive) and semantic codes such as “+Conc” and “+Abst” must be prefixed with the character “+”.

NooJ is case sensitive; for instance the two codes “+s” (e.g. singular) and “+S” (e.g. Subjunctive) are different.

Properties may have values; in the example above, we see that the noun “abandon” is associated with the property “+FLX=APPLE”. That means that its inflectional paradigm is “APPLE” (i.e. it inflects like “apple”). In the same way, the inflectional paradigm of the verb “to abandon” is “ASK”, etc. Inflectional and derivational paradigms are described in Inflectional/Derivational description files (see later).

Remember that NooJ does not know what these codes mean, and it cannot verify that the codes used in a query or in a grammar, as entered by the user, are correct. For example, nothing prohibits the user from writing the following query:


although this query will probably not identify anything, because no NooJ dictionaries use these codes (that is, until someone does!). The advantage of this “freedom of expression” is that users are free to invent and add any code in their dictionary.

Before using lexical symbols in queries or grammars, make sure that they are actually present in the dictionaries you are using!

Now consider the word form “abandoned”. In the dictionary, we see that there is an entry “abandoned” (Adjective). At the same time, we will see later that the inflectional paradigm “ASK”, which is the inflectional paradigm for the verb “abandon”, produces the conjugated form “abandoned”, with two analysis: “Preterit”, or “Past Participle”. In consequence, NooJ’s lexical parser will process “abandoned” as three time ambiguous.

Morphological ambiguity: when a word form is associated with more than one morphological analysis, it is processed as ambiguous.

Other, more specialized dictionaries are also available on the NooJ websites http://www.nooj4nlp.net. If you have built such dictionaries and you think that they might be useful for other NooJ uses, please send them to us (with some documentation!).

Lexical variants

NooJ lexical entries can be linked to a “super-lemma”, that acts as a canonical form for the lexical entries as well as all their inflected and derived forms:






Lexical entries linked to the same super-lemma will be considered as equivalent from NooJ’s point of view. Although each lexical entry has its own inflectional paradigm (e.g. “czar” inflects as “czars” in the plural), all the inflected word forms:

tsar, tsars, csar, csars, czar, czars, tzar, tzars

will be stored in the same equivalence class, and any of the symbols <czar>, <tzars>, etc. will match the previous eight word forms.

Lexical variants can be simple or multi-word terms, e.g.:


Big Blue,ibm,N+Company

IBM Corp.,ibm,N+Company

IBM Corporation,ibm,N+Company

Armonk Headquarters,ibm,N+Company


In the latter case, the symbol <ibm> will match any of the term’s variants.

Super-lemmas, as well as lemmas, must never be spelled in uppercase. NooJ needs to distinguish between queries on syntactic categories, such as “<ADV>” and queries on lemmas (or super-lemmas), such as “<be>”.

Inversely, dictionaries can also contain entries which are in fact simple words, for example:

board,school board,N+Organization

board,bulletin board,N+Conc

board,circuit board,N+Conc # electronics

This allows us to represent ambiguities associated with certain simple words; therefore, the simple word “board” will be treated as ambiguous after consultation of the dictionary.

Finally, the super-lemma associated to each entry of the dictionary can also be used in translation applications; for example:

carte bleue,credit card,N+Conc

carte routière,road map,N+Conc


carte postale,postcard,N+Conc

Complex tokenization

Some times, one single token corresponds to more than one Atomic Linguistic Unit. For instance, the word form “cannot” must be analyzed as a sequence of the two ALUs: “can” (Verb) “not” (Adverb). In a NooJ dictionary, one would represent this tokenization by the following complex lexical entry:


where <can=V> and <not=ADV> are lexical constraints that are checked by looking up dictionaries. These lexical constraints allow the former token to be analysed as the verb “can” and not the noun.

Similarly, when a sequence of tokens corresponds to more than one ALU, associate the corresponding NooJ dictionary entry with a sequence of lexical constraints:


Finally, some of these contracted sequences might be unambiguous; in that case, one can add the +UNAMB feature (see below) to the entry:


Special Information Codes

Although NooJ allows users to use and invent any code to describe lexical entries, several codes have a special meaning for NooJ:

+NW non-word

+UNAMB unambiguous word

+FLX inflectional paradigm

+DRV derivational paradigm

The code +NW (non word) is used to describe abstract lexical entries that do not occur in texts, and should not be analyzed as real word forms. This feature is particularly useful when building a dictionary in which entries are stems or roots of words.

The code +UNAMB (unambiguous word) tells NooJ to stop analyzing the word form with other lexical resources or morphological parser. For instance, consider the following lexical entries:


round table,N+UNAMB+Abst


If this dictionary is used to parse the text sequence “... round table ...”, then only the lexical hypothesis “round table = N+Abst” will be used, and the other solution, i.e. “round,A table,N+Conc” will be discarded. In the same way, if the dictionary contains a lexical entry such as:


then the property +UNAMB inhibits all other analyses, such as a morphological analysis of trader = Verb “to trade” + nominalization suffix “-er”.

The properties +FLX and +DRV are used to describe the inflectional and derivational paradigms of the lexical entry, see below.

9.3 Free sequences of words vs. multi-word units

An important problem faced by the project of describing natural languages on a wide scale is that of the limit between multi-word units (that must be described in dictionaries) and free word sequences (that must be described by syntactic grammars). It is evident for those who perform automatic analyses on natural language texts, that in the following examples:

a red neck, a red nail, a large neck

The first Adjective-Noun sequence must be lexicalized: if red neck is not included in a dictionary, a computer would be incapable to predict its lexical properties (for example, “+Hum” for human noun), and certain applications, such as automatic translation, would produce incorrect results (such as “cou rouge” instead of “péquenaud” in French).

These two examples are straightforward; but between these two extremes (an idiomatic multi-word unit vs a free series of simple words), there are tens of thousands of more difficult cases, such as:

rock group, free election, control panel

I have adopted a set of criteria that establish limits, within the NooJ framework, between those sequences that should be lexicalized (i.e. multi-word units), and those that we choose not to lexicalize (cf. [Silberztein 1993]). The primary criteria are:

Semantic atomicity: if the exact meaning of a sequence of words cannot be computed from the meaning of the components, the sequence must be lexicalized; it is therefore treated as a multi-word unit.

For example, the noun group is used in a dozen or so meanings: age group, control group, rock group, splinter group, umbrella group, tour group, etc. Although the word “group” in each of these sequences always refers to a “group of people”, each meaning of “group” is different: an age group is a subset of a population characterized by their age; it is not an organization. A control group is a subset of people used for comparison with others as part of a medical experiment. A rock group is a music band. A splinter group is an organized set of people who leave a larger organization to form a new one. A tour group is a group of tourists who travel together in order to visit places of interest. An umbrella group is an organization that represents the interests of smaller local organizations.

Only the first meaning of group (i.e. “band”) is relevant in the noun phrase a rock group, which cannot mean: a set of rocks, a rocker organization, the audience at a rock concert, etc. In the same way, a tour group is not a band on a tour; a control group is not an organization who aims at controlling something or a music band, etc.

Similarly, the noun “election” is used in a dozen or so meanings:

presidential election, party election, runoff election, general election, free election

Here too, each of these phrases have a different semantic analysis:

presidential election = people elect someone president

party election = a party elects its leader/representative

free election = an election is free = people are free to participate in the election

and, at the same time, inhibits other possible analyses: in a presidential election, presidents do not elect their representative; in a general election, people do not vote for a general, etc.

Is it then possible that the modifier is responsible for the selection of each of the dozen meanings of “election” or “group”? In other words, maybe the simple word “control” could be linked to the concept “experimental test” in such a way that “control”, combined with “group”, would produce the meaning “test group used for experiments”? Unfortunately, even a superficial study proves this hypothesis wrong. For instance, consider the following noun phrases built with “control”:

A control cell, a control freak, a control panel, a control surface, a control tower

A “control cell” is indeed a test cell used after a biological experiment for comparisons with other cells, but a “control freak” is not a “freak” compared with others after some experiments. A “control panel” is a board used to operate a machine. A “control surface” is a part of an aircraft used to rotate it in the air. A “control tower” regulates the traffic of airplanes around an airport, etc. Thus, the semantic characteristics of the constituent “control” are different in each of these sequences.

In conclusion, words such as “control” or “group” potentially correspond to dozens of concepts. Somehow, combining these two ambiguous words produces an unpredictable semantic “accident”, that selects one specific, unambiguous concept among the hundred potential ones. In order to correctly process these combinations as unambiguous, we must describe them explicitly. In effect, this description -- naturally implemented in a NooJ dictionary -- is equivalent to processing them as multi-word units.

terminological usage: if, to refer to a certain concept or object, people always use one specific sequence of words, the sequence must be lexicalized; it is therefore treated as a multi-word unit.

Fluent speakers often share a single specific expression to refer to each of numerous objects or concepts in our world environment. For instance, the meaning of the compound adjective “politically correct” could be as well expressed by dozens of expressions, including “socially respectful”, “derogatorily free”, “un-offensive to any audience’s feelings”, or, why not even “ATAM” for “Acceptable To All Minorities”. However, there is a shared consensus among English speakers on how to express this concept, and any person who does not use the standard expression would “sound funny”.

Numerous noun phrases are used in a similar way. Compare, for instance, the following noun phrases to the left and right:

a shopping mall a commercial square, a retail district, a shop place

a washing machine a cloth washer, an automatic cleaner, a textile cleaning device

a health club a health saloon, a musculation club, a gym center

a word processor a word typing program, a software writer, a text composer

a parking lot a car area, a car square, a vehicle park, a resting car lot

an ice cream a creamed ice, a sugary ice, an icy dessert, a cool sweet

There is no morphological, syntactic, or semantic reason why the phrases to the left are systematically used, while none of those to the right is ever used. From a linguistic point of view, the noun phrases to the right are potential constructs that follow simple English syntactic rules but never occur in fact, whereas the terms to the left are elements of the vocabulary shared by English speakers, and hence must be treated as Atomic Linguistic Units (i.e. lexicalized multi-word units).

Illusion of productivity

One problem that has often misled linguists is that terms often exist in series that are apparently productive. For instance, consider the following list of terms constructed with the noun “insurance”:

Automobile insurance, blanket insurance, fire insurance, health insurance, liability insurance, life insurance, travel insurance, unemployment insurance

The fact that there are a potentially large series of insurance terms may lead some linguists to assume that these terms are productive and therefore should not be listed nor described explicitly. But the alternative to an explicit description of these terms would be to use general syntactic rules, such as “any noun can be followed by ‘insurance’.” These rules would be useless if one wanted to distinguish the previous terms from the following sequences:

Anti-damage insurance, cover insurance, combustion insurance, sickness insurance, responsibility insurance, death insurance, tourism insurance, job loss insurance

In fact, each term from the former series can be linked to some description that would define its properties or meaning (e.g. a legal contract), whereas none of the sequences from the latter series actually has any precise meaning.

9.4 Properties Definition File

In NooJ dictionaries (just like in INTEX’s DELA-type dictionaries), users can create new information codes, such as +Medic, +Politics or +transitive and add them to any lexical entry. These codes become instantly available and can be used right away in any NooJ query or grammar, by using lexical symbols such as <N+Medic> or <N+Hum-Politics+p>, etc.
NooJ users can also use information codes that are in the form of a property with a value, such as in “+Nb=plural” or “+Domain=Medic”. This allows NooJ to link properties and their values, as can be shown in the table view of the dictionary (DICTIONARY > View as table):

Figure 2. Displaying a dictionary as a table

In this dictionary, the FLX and the Syntax properties are associated with values, such as “+FLX=LIVE” or “+Syntax=aux”.

It is also possible to define the relationship between categories, properties and their values by using a Dictionary Properties’ Definition file. In this file, users can describe what properties are relevant for each morpho-syntactic category, and what values are expected for each property. The following figure shows a few properties’ definitions:

V_Pers = 1 + 2 + 3;

V_Nb = s + p;

V_Tense = G + INF + PP + PR + PRT;

V_Syntax = aux + i + t ;

These definitions state that the category “V” is associated with properties “Pers”, “Nb”, “Tense” and “Syntax”; these properties’s possible values are then listed. If these definitions are used, then the notation “+Syntax=aux” can be abbreviated into “+aux”.

Finally, it is possible to let NooJ know that certain property values are inflectional features. That allows NooJ’s morphological parser to be able to analyze a word form by making it inherit its inflectional (or not inflectional) properties, from another word form. For instance, we can tell NooJ’s morphological parser to analyze the word form “disrespects” by copying all the inflectional features of the word form “respects” (i.e. +PR+3+s). To let NooJ knows what the inflectional features, we enter a single rule in the dictionary properties’ definition file:

INFLECTION = 1 + 2 + 3 # person

+ m + f + n # gender

+ s + p # number

+ G + INF + PP + PR + PRT # tense


Chapter 10. Inflection And Derivation

In order to analyze texts, NooJ needs dictionaries which house and describe all of the words of that text, as well as some mechanism to link these lexical entries to all the corresponding (inflected and/or derived) forms that actually occur in texts.

Inflected forms: in most languages, words are inflected as they are conjugated, used in the plural, in the accusative, etc. For instance, in English, verbs are conjugated, nouns are inflected in the plural. In French: verbs are conjugated, nouns and adjectives are inflected in the feminine and in the plural.

Derived forms: in most languages, it is possible to use a word, with a combination of affixes (prefixes or suffixes) to construct another word that is not necessarily of the same syntactic category. For instance, from the verb “to mount”, we can produce the verb “to dismount”, as well as the adjectives “mountable” and “dismountable”; from the adjective “American”, we produce the verb “to americanize”, etc.

Lexical entries in NooJ dictionaries can be associated with a paradigm that formalizes their inflection, i.e. verbs can be associated with a conjugation class, nouns can be associated with a class that formalizes how to write them in the plural, etc.

The aim of describing the inflection of a lexical entry is to be able to automatically link all its word forms together, so that, for instance, the lexical symbol:


matches any of the forms in the set:

be, am, is, are, was, were, been, being

Note that lexical symbols do not define equivalence sets, because of potential ambiguities. For instance, the lexical symbol <being> matches all the previous word forms, as well as the word form “beings” (because being is ambiguous, it belongs to two different sets of word forms).

10.1 inflectional Morphology

NooJ’s inflection module is triggered by adding the special property “+FLX” to a lexical entry. For instance, in the _Sample.dic dictionary (file stored in “Nooj\en\Lexical Analysis”), we see the following entries:






This sample of a dictionary states that lexical entries artist, cousin, pen and table share the same inflectional paradigm, named “TABLE”, while the lexical entry man is associated with the paradigm “MAN”.

NooJ provides two equivalent tools to describe these inflectional paradigms: either graphically or by means of (textual) rules.

Both descriptions are equivalent, and internally compiled into a Finite-State Transducer (in “.nof” files).

Describing inflection graphically

The dictionary must contain at least one line that starts with a command such as:

#use NameOfAnInflectionalGrammar.nof

Paradigm names correspond to graphs included in the inflectional/derivational grammar. For example, for the following lexical entry:


there must exist, in the inflectional grammar of the dictionary, a graph called “TABLE” which describes the two forms cousin and cousins. All of the nouns inflected in this manner (e.g. artist, pen, table, etc.) are associated with the same graph.

Inflectional paradigms can be described by transducers that contain as inputs, the suffixes that must be added to the lexical entry (i.e. lemma) in order to obtain each inflected form, and as outputs, the corresponding inflectional codes (“s” for singular, and “p” for plural). For example, here is the graph TABLE associated to nouns such as cousin: This graph is used in the following manner:

Figure 3. The graph for paradigm TABLE

-- (upper path) if we add the empty suffix to the lexical entry, we produce the form “cousin” associated with the inflectional codes “s” (singular);

-- (lower path) if we add the suffix “s” to the lexical entry, we produce the form “cousins” associated with the inflectional code “p” (plural).

Special operators

In English, numerous lexical entries are not merely prefixes of all their inflected forms; for example, the entry “man” is not a prefix of the form “men”. In order to obtain that form from the lemma, we must delete the two last letters “n” and “a” of the lemma, and then add the suffix “en”.

In NooJ, to delete the last character, we use the operator <B> ( Backspace) and its variant <B2> (to perform the operation twice):

Figure 4. Inflectional paradigm MAN

Inflecting the lemma into its plural form is performed in three steps:

man.<B2>en → m.en → me.n → men

where the dot represents the top of a stack. Note that each operation (either delete a number of characters, or add a new one) takes a constant time to run. Therefore each inflected form can be computed in a time proportional to its length (O(n)).

Thanks to the <B> operator, we can represent all possible types of inflection, including those considered more “exotic”; for example:

recordman → recordman<B3>woman → recordwoman

However, NooJ includes 10 other default operators:

<E> empty string

<B> delete last character

<D> duplicate last character

<L> go left

<R> go right

<N> go to the end of next word form

<P> go to the end of previous word form

<S> delete next character

For instance, in order to describe the plural form of the compound noun bag of tricks, we could perform the following operation:

bag of tricks → <P2>s → bags of tricks

(go to the end of the previous word form twice, then add an “s”).

Users can modify the behaviour of these operators, and add their own. For instance, the operators <A> (remove accent to the current letter), <Á> (add an acute accent) and <À> (add a grave accent) were added to most Romance languages, the operator <F> (“finalize” the current letter) was added to Hebrew, and the behaviour of the <B> command in Hebrew takes into account silent “e” when performed on a final consonant.

Use of Embedded graphs

In an inflectional grammar, it is possible to use graphs that will be embedded in one or more inflectional paradigms. This feature allows linguists to share a number of graphs that correspond to identical sets of suffixes, and also to generalize certain morphological properties. For instance, the following French graph represents the inflectional paradigm of the noun “cousin”:

Figure 5. Inflectional paradigm with embedded graphs

the embedded graph “Genre” takes care of the gender suffix (add an “e” for feminine) and the graph “Nombre” takes care of the number suffix (add an “s” for plural):

Figure 6. The two embedded graphs

In the same manner, one could design a French graph “Future” that would represent the set of suffixes: ai, as, a, ons, ez, ont, and simply add this graph to a number of verb conjugation paradigms.

Describing Inflection textually

The dictionary must contain at least one line that starts with a command such as:

#use NameOfAnInflectionalDescriptionFile.nof

Paradigm names correspond to the rules included in the inflectional description file. For example, for the following lexical entry:


there must exist, in the inflectional description file of the dictionary, a rule called “ASK” which describes the all conjugated forms of the verb to help. All of the verbs that conjugate in this manner (e.g. love, ask, count, etc.) are associated with the same rule. Below is the rule “ASK”:

ASK = <E>/INF + <E>/PR+1+2+s + <E>/PR+1+2+3+p + s/PR+3+s + ed/PP + ed/PRT + ing/G ;

This paradigm states that if we add an empty string to the lexical entry (e.g. help), we get the infinitive form of the verb ( to help), the Present, first person or the second person singular ( I help), or any of the plural forms ( we help). If we add an “s” to the entry, we get the Present, third person singular ( he helps). If we add “ed”, we get the past participle form ( helped) or any of the preterit forms (we helped). If we add “ing”, we get the gerundive form ( helping).

Never forget to end each rule definition with a semi-colon character “;”.

Use of Embedded rules

Just as it is possible to embedd graphs in an inflectional/derivational grammar, it is also possible to add auxiliary rules that do not correspond directly to paradigm names, but can be used by other rules. For instance, in the following French inflectional description file:

Genre = <E>/m + e/f;

Nombre = <E>/s + s/p;

Crayon = <E>/m :Nombre;

Table = <E>/f :Nombre;

Cousin = :Genre :Nombre;

The two rules Genre and Nombre are auxiliary rules that are shared by the three paradigms Crayon, Table and Cousin.

10.2 Inflecting Multiword units

Multiword units which inflection operates on the last component are inflected exactly in the same manner as simple words. For instance, to inflect the term “jet plane”, one can use the same exact rule as the one used to inflect the simple noun “plane”:

PEN = <E>/singular + s/plural;

However, when other components of a multiword unit inflect, such as in “man of honor”, NooJ provides the two operators <N> (go to the end of the next word form) and <P> (go to the end of the previous word form) in order to inflect a selected component:

MANOFHONOR = <E>/singular + <PW><B2>en/plural;

The operator <PW> moves the cursor to the end ot the first component of the multiword expression, i.e. “man”. The operator <B2> deletes the two last letters of the word form, then the suffix en is added to the word form; the resulting plural form is “men of honor”.

Note that one could have used the <P2> operator (go to the previous word form, twice) to do exactly the same thing. However, the operator <PW> is more general, and our paradigm works the same, dependless of the length of the multiword unit, e.g. “man of the year”, “man of constant sorrow”, etc.

Another, less direct but more powerful method is to reuse the paradigms of simple words in order to describe the paradigms of multiword units. For instance, the inflection of “man of honor” could be described by the two rules:

MAN = <E>/singular + <B2>en/plural;


The first paradigm is used to inflect the simple noun “man”; the second paradigm reuses the first one to inflect all the multiword units that start with “man”.

When inflecting multiword units in which more than one component needs to be inflected at the same time, there are two cases:

(1) there is no agreement between the inflected components

For instance, in the multiword unit “man of action”, one can find the four variants: “man of action”, “man of actions”, “men of action”, “men of actions”. In that case, the paradigm used to describe the inflection of the multiword unit is very simple: just insert the paradigms used to inflect each of the inflected components:


(1) the inflected components agree in gender, or in number, or in case, etc.

For instance, in the French multiword unit “cousin germain”, the noun “cousin” and the adjective “germain” agree both in gender and in number. In that case, we describe the inflection of the multiword unit exactly as if the components do not agree, e.g.:

COUSIN = <E>/mas+sin + s/mas+plur + e/fem+sin + es/fem+plur ;


However, we tell NooJ to check the agreement between the components by checking the option “Check Agreement” in the Dictionary Compilation window:

Figure 7. Agreement in multiword units

10.3 Derivational Morphology

Inflectional grammars (both graphical and textual) can include derivational paradigms. Derivational paradigms are very similar to inflectional paradigms, except that the morpho-syntactic category of each word form must be explicitly produced by derivational transducers. The special information “+DRV” is used to indicate a derivational paradigm.

For instance, here is a derivational description:

ER = er/N;

This rule states that by adding the suffix “er” to the lexical entry, one gets a noun. Now consider the following lexical entry:


This entry makes NooJ produce the noun “laugher”, which is then inflected according to the paradigm “TABLE”. The two forms “laugher” and “laughers” will then be lemmatize as “laugh”, even though they will be associated with the category “N”. In consequence, query symbols such as <laugh> will match both the conjugated forms of the verb “to laugh”, as well as its derived forms, including the noun “a laugher”, adjectives such as “laughable”, etc.

Default inflectional paradigm

Often, forms that are derived from an initial lemma inflect themselves exactly like the initial lemma. For instance, the verb “to mount” can be prefixed as “dismount”; the latter verb inflects just like “to mount”. In the same manner, the prefix “re” can be used to produce the verb “remount”, that inflects just like “mount”.

In these cases, it is not necessary to specify the inflectional paradigm to be applied to the derived form: just omit it. For instance:


states that the verb “to mount” inflects according to the inflectional paradigm “ASK”, and then derives according to the derivational paradigm “DIS”. The latter paradigm produces the verb “dismount”, which then inflects according to the default inflectional paradigm “ASK”. If the derivational paradigm “DIS” is defined as:

DIS = <LW>dis/V;

the previous entry will allow NooJ to recognize any conjugated of the verbs “mount” and “dismount”.

In the same manner, consider the following lexical entry


This entry allows NooJ to recognize any conjugated form of the verb “to laugh”, plus the adjective form “laughable”, plus the two noun forms “laugher” and “laughers”. Inflectional / derivational graphical and textual grammars (.nof files) are equivalent: NooJ compiles both into the same Finite-State Transducers, which are interpreted by the same morphological engine. Graphs are typically used for small paradigms, such as the English conjugation system (where we have only a few word forms per class), whereas textual rule-based descriptions are used for heavier inflectional systems, such as Verb conjugations in Romance Languages (30+ forms per verb).

10.4 Compile a dictionary

To make NooJ apply a dictionary during its linguistic analysis of texts and corpora, you need to compile it.

(1) Make sure that the inflectional / derivational grammars (“.nof” files), as well as the dictionary’s properties definition files (“.def” files) are stored in the same folder as the dictionary (usually, in the Lexical Analysis folder of the current language).

(2) Make sure that you have added the commands to use the inflectional / derivational paradigms’ definitions as well as the properties definitions before they are needed (preferably at the top of the dictionary), e.g.:

#use properties.def

#use nouns-inflection.nof

#use verbs-inflection.nof

#use nominalization.nof

(3) When the dictionary is ready, compile it by clicking Lab > Dictionary > Compile. This computes a minimal deterministic transducer that contains all the lexical entries of the original dictionary, plus all the corresponding inflectional and derivational paradigms, represented in such a way that all the word forms associated with the lexical entries are both recognized and associated with the corresponding properties.

This transducer is deterministic because the prefixes of all lexical entries are factorized. For example, if several thousands of entries begin with an “a”, the transducer contains only one transition labeled with letter “a”;

This transducer is minimal because the suffixes of all word forms are also factorized. For example, if tens of thousands of word forms end with “ization”, and are associated with information “Abstract Noun”, this suffix, along with the corresponding information is only written once in the transducer.

Attention INTEX users: the compiled version of a NooJ dictionary is NOT equivalent to the transducer of a DELAF/DELACF dictionary. NooJ’s compiled dictionaries contain all (inflectional and derivational) paradigms associated with their entries; this allows NooJ’s parsers to be able to perform morphological operations, such as Nominalization, or Passivation. In essence, this new feature makes it possible to implement Transformational Analysis of texts.

(4) Check the compiled dictionary (a “.nod” file) in the Info > Preferences > Lexical Analysis, so that next time “Linguistic Analysis” is performed, the dictionary will be used. You might also want to give it a high, or low priority.

The menus DICTIONARY and Lab > Dictionary propose a few useful tools:

-- sort your dictionary ( DICTIONARY > Sort). NooJ’s Sorting program respects the current language’s standard alphabetical order. It takes comments and empty lines into account; that allows one to create zones in dictionaries, such as a zone for Nouns, followed by a zone for Adjectives, etc. NooJ will sort each zone independantly from each other;

-- check the format of its lexical entries: DICTIONARY > Check;

-- check that all the properties and corresponding values that you have used in the dictionary are consitent with the Properties’ definition file (you just need to display the dictionary as a table, and then check that all columns are correctly filled).

-- check the inflection and derivation implemented by your dictionary, and display the resulting list of word forms, associated with the corresponding properties. To do so, Click Lab > Dictionary > Inflect;


Word forms are usually analyzed by a lookup of a dictionary associated with some optional inflectional or derivational description, as seen in the previous chapters. But there are cases where it is more natural to use productive morphological rules, rather than dictionaries, to represent sets of word forms. In NooJ, morphological rules are implemented in the form of “morphological grammars”, that are grammars which input recognizes the word forms, and which output computes the corresponding linguistic information. Morphological grammars, just like dictionaries, are stored in the “Lexical Analysis” folder of the current language, and the results of the morphological parsing are stored, exactly as the results of the dictionaries’ lookup, in the Text Annotation Structure.

A morphological grammar (i.e. “rule”) can be as simple as a graph that recognizes a fixed set of word forms, and associates them with some linguistic information. More complex grammars can recursively split word forms into smaller parts: the “affixes”, and then enforce simple or more complex morpho-syntactic constraints to each affix.

11.1 Lexical Grammars

Lexical grammars -- i.e. with no constraints -- are simple transducers or RTNs that associate recognized word forms with linguistic information. Generally, one uses these grammars to link together families of variants, or when the number of word forms would be too large to be listed in a dictionary, whereas they can easily be represented by a few productive grammars.

A simple lexical grammar

The following lexical grammar can be found in “My documents\Nooj\en\Lexical Analysis”. It is an example of an elementary grammar that recognizes a family of spelling variants:

Figure 8.Morphological Grammar “tsar”

This graph recognizes sixteen variants of the word form tsar, and associates them with the lemma tsar and the linguistic features “N+Hum” ( Noun, Human), as well as their gender ( +m or +f) and number ( +s or +p).

To enter the output attached to a node, follow the label of the node with a slash character “/”, and then the lexical information. There are comments displayed in green. To enter a comment, create a node that will not be connected to any other node. The arrows are nodes labeled with the empty string (<E>).

Some of the information is being computed “on the fly”, i.e. along the path, during the recognition process. For instance, the suffix “ina” is locally associated with the linguistic feature +f.

This grammar is equivalent to the following dictionary:

















This grammar, and similar ones, can be used by software applications to associate “variants” of a term, whether orthographic, phonetic, synonymous, semantic, translation, etc., with one particular canonical form that acts as an index key, an hyperonym, or canonical representative, or a “super lemma”. For instance, using the previous grammar, indexing a corpus would produce one single index key for the sixteen word forms, which is much better than regular indexers that typically builds several unrelated index keys for csar, czar, tsar and tzar. Moreover, NooJ’s lexical symbol <tsar> now would match these sixteen word forms in texts.

Sets of synonymous terms (such as WordNet’s Synsets) can be represented via this feature. NooJ then becomes a search engine capable of retrieving synonymous words, e.g. the query <germ> matches all its “family’s members” such as bacteries, decease, sickness, etc. as well as all its translations, if WordNet dictionaries for different languages are aligned.

Roman numerals

Here is a simple example of a morphological grammar that recognizes roman numerals:


It is out of the question to create a dictionary containing all of the roman numerals (here we arbitrarily stop at 3,999). Rather, we create a morphological grammar that contains four graphs to represent the units, the tens, the hundreds and the thousands, as well as a main graph to bring all the graphs together.

The following morphological grammar can be found in “My documents\Nooj\en\Lexical Analysis”.

Figure 9. Roman numerals, main graph

This grammar recognizes and analyses Roman Numerals from I (1) to MMMCMXCIX (3,999), such as “CXXXII”, and then tag them. The resulting tags look like:


i.e. exactly as if this line had been explicitly entered in a dictionary. In order to do that, the grammar produces the output “A+RN=” followed by the numeric value of the roman numeral.

For example, in the main graph displayed above, the initial node is labeled as:


and the three <E> nodes (displayed as arrows) at the bottom of the graph, that are used to skip the “Hundreds”, the “Tens” and the “Units”, are naturally labeled with:


This morphological grammar is more complex than the previous “tsar” one, because it contains references to embedded graphs (displayed as auxiliary nodes in yellow).

NooJ grammars are organized sets of graphs. Each graph can include auxiliary nodes, that are references to other graphs. This recursive mechanism makes NooJ grammars equivalent to Recursive Transition Networks.

To enter an auxiliary node (displayed in yellow), prefix the label of the node with a colon character “:”, and then enter the name of the embedded graph.

For example, in the main graph displayed above, the node Units is labeled as “:Units”. Display it ( Alt-Click an auxiliary node to explore its content):

Figure 10. Roman numerals, the units

Notice that each path implements a local translation of a roman numeral (e.g. “VII”) with its corresponding Arabic number (“7”).

You can navigate in a grammar’s structure, either by pressing the following keys:
-- “U” to display the current graph’s parent,
-- “D” to display the current graph’s first child,
-- “N” to display the next child,
-- “P” to display the previous child,
-- “Alt-Click” an auxiliary node to display it.
or by displaying the grammar’s structure in a window ( GRAMMAR > Show Structure).

The Tens graph is shown below:

Figure 11. Roman numerals, the tens

In the same way, the graph that represents the “hundreds” is similar to the previous ones: just replace the “X” symbols with C’s and L’s with D’s. Finally, the graph that represents the “Thousands” is displayed below:

Figure 12. Roman numerals: the thousands

You can check this grammar by entering a few valid roman numerals in the grammar’s contract ( GRAMMAR > Show Contract):

Figure 13. A grammar’s contract

Make sure that valid roman numerals are indeed recognized by clicking the Check button. On the other hand, you can enter counter-examples, i.e. word forms that are not roman numerals, e.g. “iiii”; in that case, prefix them with a star character “*” to tell NooJ that they must NOT be recognized.

The grammar’s contract is a series of examples (i.e. words or expressions that MUST be recognized by the grammar, as well as counter-examples, i.e. words or expressions that MUST NOT be recognized by the grammar. When saving a grammar, NooJ always check that its contract is honored.

When developing complex grammars, i.e. grammars with embedded graphs, that you intend to use for a long time, or that you will share with others, always use contracts: contracts garantee that you will never break a grammar without noticying!

A simple grammar for unknown words

The two previous grammars ( tsar and Roman Numerals) are equivalent to dictionaries, i.e. they recognize and tag a finite set of word forms that could also be listed extensively in a dictionary. NooJ morphological grammars can also represent infinite sets of word forms.

For instance, here is a (rather naïve) grammar that implement a Proper name recognizer:

Figure 14. Naïve Proper Name recognition

The symbol <U> matches any uppercase letter; the symbol <W> matches any lowercase letter. Hence, this graph recognizes all the word forms that begin with an uppercase letter, followed by one or more lowercase letters. For instance, the grammar matches the word forms “Ab” and “John”, but not “abc”, “A” or “INTRODUCTION”. All recognized word forms will be associated with the corresponding linguistic information, i.e. “N+PR” ( Noun, Proper Name).

NooJ’s morphological module uses the following special symbols:
<L> any Letter
<U> any Uppercase letter
<W> any loWercase letter
<A> any Accented letter
<N> any uNaccented letter
If necessary, NooJ’s Object Oriented morphological engine would make it easy to add other symbols, such as <V> for vowels, <F> for final letters (for Semitic languages), etc.

The previous grammar recognizes an infinite set of word forms, such as John and The (for instance, when this word form occurs at the beginning of a sentence). Note that when a word form is recognized, it is processed as if it was an actual entry of a dictionary, for instance:



It is best to give morphological grammars that implement productive rules a low priority (see next chapter), to make sure that only word forms that were not recognized by other dictionaries or grammars are analyzed.

Computing lemma and linguistic information

In the previous examples, the lemma associated with the word forms to be analyzed was either explicitly given (e.g. “tsar”), or implicitly identical to the word forms to be recognized. In the same manner, the linguistic information to be associated with the recognized word forms, i.e. “N+Hum” or “A+RN=132” was explicit in the output of a graph.

It is also possible to compute the resulting lemma and/or bits of the resulting linguistic information. To do that, one must explicitly produce the actual linguistic unit as a NooJ annotation (in NooJ, annotations are internally represented between angles “<“ and “>”). This explicit notation triggers several functionalities, including the capability of assembling the final annotation along the path, the capability to produce more than one annotation for a word sequence, and the use of variables to store affixes of the recongnized sequence.

Consider the following grammar:

Figure 15. Removing the suffix “-ize”

This grammar recognizes any word form that ends with “ize”, “ise” or “ify” (the loop with <L>’s matches any sequence of letters). During the recognition process, the prefix (i.e. the beginning letters) is stored in the variable $Pref.

To store an affix of a sequence in a variable while parsing the sequence, insert the special nodes “$(“ and “$)” around the affix. You need to name the variable: in order to do so, add the variable’s name behind the opening parenthesis, e.g. “$(VariableName”. Variables’ nodes appear in red by default.

Recognized word forms are then associated with the corresponding annotation:


in which the variable $Pref is replaced with its value. For instance, when the word form “americanize” is recognized, the variable $Pref stores the value “american”; the word form is then associated with the following annotation:


i.e. as if one of NooJ’s dictionaries contained the following lexical entry:


Similarly, the word form “frenchify” will be tagged as <french,V+RenderA>.

Morphological grammars produce a superlemma that can be used in any of NooJ queries of syntactic grammars; for instance, the symbol <american> will now match American and Americans as well as americanize.

Note that unfortunately, the previous “naïve” grammar also matches word forms such as “size”, and then produces an incorrect analysis:


We will see how to fix this problem later.

By splitting word forms into several affixes stored in variables, it is possible to process complex morphological phenomena, such as the deletion or addition of letters at the boundaries between affixes. For instance, the following graph could be used to process the French ‘ism’ suffix, when used after the name of a political figure:

Figure 16. Managing affix boundaries

When the word form “Jospinisme” is recognized, variable $Npr stores the prefix “Jospin”. The resulting annotation becomes:


When the word form “Chiraquisme” is recognized, variable $Npr holds the prefix “Chira”. The resulting annotation is:


Notice that the lemma here is produced by concatenating the value of $Npr and a final “c”: $Npr#c. As with dictionary entries, the special code +UNAMB disables any other analyses, so that for the word form “Chiraquisme” is not recognized by the top path as well (that would produce the incorrect lemma “Chiraqu”).

One word form represents a sequence of more than one annotation

Morphological grammars can also produce more than one annotation for a particular word form. This feature allows one to process contracted words, such as the word form “cannot” or the French contracted word “du”. The following grammar for instance associates “cannot” with a sequence of two annotations:

Figure 17. Contracted words

Note that if this resource is checked in Info > Preferences > Lexical Analysis, the linguistic analysis of texts will annotate the word form cannot as a sequence of two annotations:

Figure 18. Annotation for contracted word “cannot”

The ability to produce more than one annotation for a single word form is essential to the analysis of Asian and Germanic languages.


Variables can also be used in the input of a grammar to check for repetitions. For instance, the following grammar checks for letter repetitions in word forms. Note in the contract that the word forms “efficient” and “grammar” are indeed recognized, whereas the word form “table” is not (the star is used to enter conter-examples in grammars’ contract). All recognized word forms will be annotated with the category “REP”.

Figure 19. A grammar that recognizes word forms with repeted letters

This grammar can be modified to recognize word forms that include two or more letter repetitions (e.g. “Mississippi”), syllable repetitions (e.g. “barbarian”), letters that occur a certain number of times (e.g. “discussions”), word forms that start and end the same way (e.g. “entertainment”), etc.

11.2 Lexical Constraints

The previous grammars can be used when one can describe the exact set of word forms to be recognized, such as the roman numerals, or when the set of word forms is “extremely” productive (i.e. with no exception), such as the proper names.

Indeed, the “generic” type of grammars that include symbols such as <L> has proven to be useful to quickly populate new dictionaries, and to automatically extract from large corpora lists of word forms to be studied and described by lexicographers. For instance, the patterns “<L>* ize” and “re <L>*” can be applied to large corpora in order to extract a set of Verbs.

When using productive grammars, one usually gives them the lowest priority, so that word forms already described in dictionaries are not being re-analyzed; for instance, in the previous grammar, we do not want the word form “Autisme” (which is listed in the French dictionary) to be re-analyzed as a political word derived from the Proper name “Aut”!

In order to control the domain of application of a morphological grammar, it is important to produce constraints on various parts of the word form that have to be checked against NooJ’s lexical resources.

In NooJ, these constraints are parts of the output produced by the morphological grammar, and are also written between angles (“<” and “>”). Note that these are the same constraints that are used in NooJ dictionaries and by the NooJ query and syntactic parsers. For instance, consider the following grammar, similar to the “naïve” grammar used above:

Figure 20. adding a lexical constraint

Just like the previous grammar, this grammar recognizes word forms such as “americanize” and also “size”. However, for the word form “americanize”, the constraint <$Pref=:A> gets rewritten as <american=:A>; this constraint is validated by a dictionary lookup, i.e. NooJ checks that amercican is indeed listed as an adjective. On the other hand, for the word form “size”, the constraint <s=:A> does not check because there is no lexical entry “s” in NooJ’s dictionary that is associated with the category “A”. Therefore, only the first analysis is produced.

Lexical constraints can be as complex as necessary; they constitute an important feature of NooJ because they give the morphological module access to the precision of any linguistic information stored in any NooJ dictionary. For instance, one can check all the verbs of a dictionary that derive to an “-able” adjective, and associate them with the morpho-syntactic feature “+able”, as in: the following dictionary:



Then, in the following morphological grammar:

Figure 21. Derivation with a complex lexical constraint

the lexical constraint <$Pref=:V+able> ensures that derivations are performed only for verbs that are associated with the feature “+able”. The resulting tag produces an Adjective, but the lemma is the initial Verb. For instance, the previous grammar produces the same analysis for the word form “showable” as the one produced by the following lexical entry:


The word forms “showable” and “laughable” are tagged only because the lexical entries “show” and “laugh” are associated with the category “V” and the feature “+able” in the dictionary.

On the other hand, the word form “table” would not be recognized because the lexical constraint <t=:V+able> does not check: “t” is not a Verb. Similarly, the word forms “sleepable” and “smileable” would not be recognized because the lexical entries “sleep” and “smile” are not associated with the feature “+able”.

Furthermore, lexical constraints can (and should) be used to limit derivations to specific classes of words, e.g. only transitive verbs, only human nouns, or adjectives that belong to a certain distributional class. For instance, consider the following grammar:

Figure 22. derivation with a complex constraint

It performs derivations only on Adjectives that are associated with the code “+Nation”, such as “american” and “french”, but would not apply to other types of adjectives, such as “big” or “expensive”.


Just like in NooJ’s query symbol, a constraint can contain a number of negative features, such as in:


( $Pref must be a Noun, not human and not plural). In the same manner, the right member of the constraint can be negated, such as in:


( $Pref must not be a Noun). Finally, the constraint can include a global negation, such as in:


( $Pref must not be a human noun). Note that the global negation is equivalent to the right-side negation.

Complex tokenizations

A grammar can produce more than one lexical constraint, so that each component of the word form is associated with a specific constraint.

Being able to tokenize a word form into a series of linguistic units is essential for Asian, Germanic and Semitic languages. For instance, the German word form:


should be associated with a sequence of three annotations such as:

<Schiff,N> <fahren,V+PP> <gesellschaft,N>

In this case, it is essential to make sure that each component of the whole word form is indeed a valid lexical entry. Lexical constraints allow us to enforce that. In NooJ, this tokenization can be performed by the following specific graph:

Figure 23. A complex tokenization in German

Notice that the third (missing) “f” between “Schiff” and “fahren” has to be re-introduced in the resulting tags, and how the extra “s” between “fahrt” and “gesellschaft” is deleted.

Variables, as well as lexical constraints, can be used in embedded graphs, as well as in loops. In this latter case, more than one occurrence of a given variable can be used along the path, with more than one value. In order to make sure that the correct value is used in the corresponding lexical constraint, make sure that each lexical constraint is inside the same loop as the variable, and in its immediate right context. For instance, consider the following grammar:

Figure 24. Lexical Constraints in a loop

If this grammar matches the complex word form “underworkmen”, the variable $Affix will be given three values successively: “under”, “work” and “men”.

Each occurrence of the variable $Affix is followed immediatly by a lexical constraint that uses its current value: the variable $Affix is first set to “under”, then the corresponding lexical constraint is <under=:DIC> ( DIC matches any word form that is actually listed as a lexical entry); then the variable is set to “work” and the following lexical constraint is <work=:DIC>; then the variable is set to “men” and the lexical constraint is set to <men=:DIC>. Each lexical constraint is followed by an annotation, and the full analysis of the word form produces a sequence of three tokens::<under,PREP><work,N><man,N+p>.

Transfer of features and properties

NooJ enforces lexical constraints to affixes of the word form by looking up dictionaries. There, these affixes are associated with features and properties that can in turn be transferred to the resulting tag(s). For instance, the word form “reestablished” can be linked to the verbal form “established”. In the dictionary, the verbal form “established” is itself associated with linguistic information, such as “Lemma = establish”, “+t” (transitive), “+PRT” (Preterit), etc.

These properties and features can then be transferred to the tag produced by the transducer, so that the resulting tag for “reestablished” inherits some of the properties of the verbal form “established”. NooJ uses special variables to store the value of the fields of the lexical information associate with each constraint. Lexical constraints (and their variables) are numbered from left to right ($1 being the first lexical constraint produced by the grammar; $2 the second, etc.), and the various fields of the lexicon are named “E” (Entry of the dictionary), “L” (corresponding Lemma), “C” (morpho-syntactic Category), “S” (Syntactic or semantic features) and “F” (inFlectional information). For instance:

$1E = 1st constraint, corresponding lexicon Entry

$1L = 1st constraint, Lemma

$1C = 1st constraint, Category

$1S = 1st constraint, Syntactic features

$1F = 1st constraint, inFlectional features

Now consider the following grammar:

Figure 25. prefixes and verbs

It recognizes the word form “dismounts” if <$Verb=:V> i.e. if the word form “mounts” is associated with the category “V” (Verb). If this constraint checks, the grammar produces the resulting tag:


In this tag, the variable $1L stores the lemma of “mounts”, i.e. “mount”; $1S stores the syntactic and semantic features for “mounts”, i.e. “+tr” (transitive), and $1F stores the inflectional information for “mounts”, i.e. “+PR+3+s” (PResent, third person, singular). As a result, the word form “dismounts” is annotated as:


Recursive constraints

The previous lexical constraints were enforced simply by looking up the selected dictionaries. It is also possible to write constraints that are enforced recursively, by looking up all lexical and morphological resources, including the current one. For instance, in the following grammar, the lexical constraint <$Verb=:V> checks recursively that the suffix of the current word form is a Verb.

Figure 26. Recursive lexical constraint

When given the word form “reremounts”, the grammar produces the lexical constraint <remounts=:V>, that triggers the morphological analysis of the word form “remounts”. This word form is then analyzed by the same grammar, that produces the lexical constraint <mounts=:V>, that checks OK thanks to a dictionary lookup. The final resulting tag is then:


Notice that the feature +RE is produced twice.

Recursivity is important in Morphology because it allows linguists to describe each prefixation and suffixation independantly. For instance, a word form such as “redismountable” can be recognized by independent grammar checks: the “V-Able” grammar produces the constraint <redismount=:V>, then the “re-V” grammar produces the constraint <dismount=:V>, then the “de-V” grammar producing the constraint <mount=:V>, then a final dictionary lookup checks OK.

11.3 Agreement Constraints

Lexical Constraints allow linguists to perform relatively simple checks on affixes: NooJ checks that an certain sequence of letters correspond to a linguistic unit with such as such property.

NooJ’s morphological engine has two other operators that can be used to check the equality (or the inequality) of two properties: the equality operator “=” and the inequality operator “!=”. For instance, the following agreement constraint:


checks that the value of property “Number” of the affix stored in variable $Nom is equal to the value of property “Number” of the affix stored in variable $Adj. Conversely, the following agreement constraint:


checks that they differ.

Note that agreement constraints can be used to simulate lexical constraints. For instance, the two following constraints are equivalent if the affix stored in variable $Nom is a noun: <$Nom$Number=”plural”>, <$Nom=:N+plural>

11.4 The special tag <INFO>

In various cases, some piece of the linguistic information that needs to be attached to the Atomic Linguistic Unit is produced along the way, not necessarily in the final order we wish to write the resulting annotation. In that case, we can use the special tag <INFO>, associated to the features and properties to be concatenated at the end of the annotation.

For instance, consider the previous grammar:

Figure 27. Use of the special tag

It recognizes the word forms “remount” as well as “dismounted”. The features “+RE” and “+DE” are produced accordingly, but they will be concatenated at the end of the resulting annotations:



11.5 Inflectional or Derivational analysis with morphological grammars

In Chapter 8, we described NooJ’s inflectional and derivational engine. It is possible to simulate it by using morphological grammars.

In this case, we would use a dictionary with no inflectional or derivational information (i.e. no +FLX or +DRV features), and we would use instead lexical constraints linked to special features in the dictionary. For instance, here would be such a dictionary:




The feature “+Conj3” would then be used in a lexical constraint produced by a grammar such as the graph below.

This graph uses the lexical constraint <$R=:V+Conj3> together with the dictionary above in order to enforce that only Verbs listed in the dictionary, and associated with the conjugation paradigm +Conj3, will be recognized.

For instance, the graph recognizes the word form “helps” (through the path at the bottom): variable $R stores the prefix “help”; NooJ looks up “help”, and verifies that it exists indeed as a lexical entry associated with the information “V+Conj3”. Then, the grammar produces the resulting analysis:


Other, more irregular or complex conjugation schemes would be handled by shortening the root -- which could even be the empty string in the case of highly irregular verbs, such as “to be”.

Figure 28. A morphological grammar that processes conjugation

Chapter 12. Lexical Parsing

NooJ’s dictionaries and morphological resources can be selected to be used by NooJ’s lexical parser. In order to select a lexical resource, open the window Info > Preferences > Lexical Analysis (make sure that the current language is properly selected), and then check its name, in the upper zone of the window if it is a dictionary, or in the lower zone if it is a morphological grammar:

Figure 29. Select linguistic resources for the lexical parser

Users can select any number of dictionaries and morphological grammars so that NooJ’s lexical parser applies them to texts every time the user performs a linguistic analysis of a text or a corpus.

To apply all selected lexical resources to the current text or corpus, use TEXT > Linguistic Analysis, or CORPUS > Linguistic Analysis.

It is easy to add a lexical resource to NooJ’s pool of lexical resources:

-- in case of a morphological grammar (a file with extension “.nom”), just store it in the Lexical Analysis folder for the corresponding language; typically, the folder will look like:

My Documents\NooJ\en\Lexical Analysis

-- in case of a dictionary (a file with extension “.dic”), compile it by using the Lab > Dictionary, and then store the resulting “.nod” file in the Lexical Analysis folder.

12.1 Periority Levels

More than one lexical resource can be selected to be used by NooJ’s lexical parser. If they have the same priority level, NooJ’s lexical parser computes their union, and word forms that are recognized by more than one lexical resource will typically produce more than one analysis.

One can also hide information thanks to a system of prioritization. Each lexical resource is associated with a priority level, that can be either “H” (High), “R” (Regular) or “L” (Low). When parsing a text,

(1) “High Priority” resources are applied to the text first;

(2) then if a word form has not been recognized by the high-level lexical resources, NooJ applies the “regular priority” lexical resources;

(3) then, if both the consultation of “high priority” and “regular priority” lexical resources have not produced any result, NooJ applies the “low priority” lexical resources.

Furthermore, there are degrees in High and Low priority levels, so that “H9” has the highest “High Priority” level, “H1” has the lowest “High Priority” level, “L1” is the highest priority among “Low priority resources” and “L9” is the lowest of all resources.

This system allows users to hide or add linguistic information at will. For example, NooJ’s English dictionary sdic usually has the “default” priority. It describes numerous usages that are not overly frequent, for example and = Verb in a technical text, e.g. “we anded the two binary values”. For a less specific application, as when processing standard texts that are not technical in nature (literature or newspapers), in which these words would never appear, it is useful to create a smaller dictionary which has a higher priority than the sdic dictionary, for instance “H1”, in which technical uses are not described, e.g. and is only described as a conjunction. In effect, This small dictionary will hide useless sdic entries and act as a filter, to filter out unwanted entries.

To give a high priority to a lexical resource, select the resource’s name in the Info > Preference > Lexical Analysis panel, then click the button “H”. To give a low priority to a lexical resource, select the resource’s name then click the button “L”. To give a regular priority to a lexical resource, select the resource’s name then click the button “R”.

The lexical resources associated with lower priority levels are applied when the application of the other lexical resources has failed. We generally use this mechanism to process unknown word forms. For instance, in the last chapter, the grammar to recognize proper names is applied only to unknown word forms: all simple forms not found in NooJ’s dictionaries and which begin with a capital letter, are identified by this grammar. Similar grammars can be used to recognize productive morphological derivations such as redestalinization, redestructurability, etc.

12.2 Disambiguation using high-priority dictionaries

We can use the system of priorities to eliminate artificial ambiguities. For example, the following multi-word units occur frequently in texts:

as a matter of fact, as soon as possible, as far as I am concerned

They do not have non-autonomous constituents, therefore, NooJ will systematically suggest two analyses: multi-word unit (eg. The adverb as a matter of fact), or the sequence of simple words (eg. The conjunction as followed by the determiner a, followed by the noun matter, followed by the preposition of, followed by the noun fact). But in truth, these multi-word units are not ambiguous. To avoid producing artificial ambiguities, we store these words in “high-priority” dictionaries, i.e. in dictionaries which are associated with priority levels above NooJ’s standard dictionary sdic.

Note that this mechanism can also be used to adapt NooJ to specific vocabularies of domain languages. For instance, if we know that in a specific corpus, the word “a” will always refer to the determiner (and never to the noun), we could enter the following lexical entry:


in a small, high-level priority dictionary adapted to the corpus.

12.3. Text Annotation Structure

The application to a text ( TEXT > Linguistic Analysis) of the lexical resources that are selected in Info > Preferences > Lexical Analysis builds the Text’s Annotation Structure ( TAS), in which all recognized Atomic Linguistic Units (ALUs), whether affixes, simple words or multi-word units, are associated with one ore more annotations.

The text’s annotation structure can be displayed via the check box “Show Annotations”. Make sure to resize the lower part of the window in order to give it enough vertical space, so that all ambiguities are shown:

Figure 30. Text Annotation Structure

Annotations and Ambiguities

Note that ambiguities are numerous; at this stage, we have only added lexical and morphological annotations to the TAS; we will see later (in the chapter on Syntactic Analysis) how to remove annotations from the TAS.

Several types of ambiguities can occur:

-- one word form corresponds to more than one lexical entry, in one or more dictionary. For instance, the word form “being” is either a conjugated form of the lexical entry “be” (the verb to be), or the noun ( a being).

-- one word form can correspond to one or more lexical entries, and at the same time, to the result of a morphological analysis. For instance, the word form “retreat” can be a verb form (e.g. the army retreats), or the concatenation of the repetition prefix “re” followed by a verb “treat” (e.g. the computer retreats the query).

-- one sequence of tokens can correspond to one or more multi-word units, and at the same time, to a sequence of simple words. For instance, the text “... round table ...” can be associated with one annotation (the noun meaning “a meeting”), or with two annotations (the adjective, followed by a noun, in the case of a round piece of furniture).

Ambiguities can be managed via NooJ’s lexical resources’ priority system (see above) and with the +UNAMB feature (see below). For instance, if we want a list of technical terms such as “nuclear plant” or “personal computer” never to be parsed as ambiguous with the corresponding sequences of simple words (e.g. the adjective “nuclear” followed by the noun “plant”), we can give the technical terms’ dictionary a higher priority (e.g. “H3”) than NooJ’s dictionary sdic. If a given lexical entry should be processed as is, and disable any other possible analyses, we add the feature “+UNAMB” to its information codes; for instance, to always process the word “adorable” as an adjective (rather than a verb form followed by the suffix “-able”), we enter the word form with the code +UNAMB in a dictionary:


Exporting the list of annotations and unknowns

At the top of the window, in the results zone after “Characters”, “Tokens” and “Digrams”, you can double-click “Annotations” to get a list of all the annotations that are stored in the TAS in a dictionary format. This dictionary can in turn be edited and exported.

In the same manner, double-click “Unknowns” to get a list of all the tokens that have no associated annotation. The “Unknowns” dictionary can be edited, for instance to replace the category “UNKNOWN” with the valid one for each entry. The resulting dictionary can then be compiled ( Lab > Dictionary), selected in the Info > Preferences > Lexical Analysis window, and then re-applied to the text ( Text > Linguistic Analysis) in a few minutes.

Unknowns do not always correspond to errors in the text or in the dictionaries. For instance, the word form “priori” is not a lexical entry, although the multi-word unit “a priori” can be listed as a compound adverb. In consequence, the adverb “a priori” will show up in the “Annotations” window, but the word form “priori”, which is not associated with any annotation, will show up in the “Unknowns” window.

Figure 31. Export the text’s annotations and unknowns as dictionaries

12.4 Special Feature +UNAMB

+UNAMB stands for “Unambiguous”. Inside one given lexical resource (dictionary or morphological grammar), there are cases where some solutions should take precedence over other ones.

For instance, consider the two following lexical entries:

United States of America,N

United States,N

These two entries could be used to recognize all occurrences of both sequences “United States of America” and “United States”. However, consider the following text:

... The United States of America have declared ...

Looking up the previous dictionary will get two matches, because both lexical entries (including the shorter one United States) are indeed found in the text. In other words, the sequence “United States of America” will be analyzed either as one noun, or as a sequence of one noun (“United States”) followed by “of”, followed by “America”.

This problem occurs very often, because most multi-word units are ambiguous, either with sub-sequences of smaller units, or with sequences of simple words. For instance, the term “nuclear plant” will be processed as ambiguous if the selected dictionaries contain the following three lexical entries:



nuclear plant,A

In order to solve these systematic ambiguities, we can add a +UNAMB feature (“Unambiguous”) to the multi-word entries:

nuclear plant,N+UNAMB

United States of America,N+UNAMB

United States,N

When NooJ locates unambiguous lexical entries in the text, it gives priority to them, thus does not even apply other lexical resources to analyze sub-sequences of the matching sequence. In other words, NooJ will simply ignore the lexical entry “United States” when analyzing the text “United States of America”; in the same manner, it will ignore the lexical entries “nuclear” and “plant” when analyzing the text “nuclear plant”.

However, we still have a problem: let’s add to the previous dictionary the two following entries:



The text “United States of America” is still parsed as one unambiguous Atomic Linguistic Unit (because “United States of America” is marked as +UNAMB); however, the text “United States” will be parsed as ambiguous: either the noun “United States”, or the sequence of two words “United” followed by “States”. In order to get rid of the last ambiguity, we need to mark “United States” as +UNAMB as well. The final dictionary is then:



United States of America,N+UNAMB

United States,N+UNAMB

Now, if more than one unambiguous lexical entries, of varying lengths, are recognized at the same position, then only the longest entries will be taken into consideration by the system. In conclusion, both text sequences “United States of America” and “United States” will now be analyzed as unambiguous.

Note finally that it is possible to get ambiguities between lexical entries of the same length; in that case, the +UNAMB feature is used to filter out all other non-ambiguous entries. For instance, the following two lexical entries:

round table,N+UNAMB+”meeting”

round table,N+UNAMB+”knights of the...”

are both tagged +UNAMB in order to inhibit the analysis “round followed by table”. However, both analyses “meeting” and “knights of the...” will be produced by NooJ’s lexical parser.

12.5 Special Feature +NW

+NW stands for “non-word”. This feature can be added to lexical entries that are not supposed to be right outputs for NooJ’s lexical parser.

Why would then one want to add a lexical entry to a dictionary, and then associate it with the +NW feature? For Romance languages and English, the tradition is to describe words by entering their lemmas in dictionaries (e.g. verbs are represented by their infinitive form). But we could also represent words by their stems or roots. For instance, consider the following dictionary entry for the verb “to have”:


and then use the following morphological grammar to analyze all conjugated forms of “to have”:

Figure 32. Morphological Analysis from a stem

This morphological grammar reconizes the word forms have, has, having and had. Each of these forms will be computed from the root “ha”, which needs to be listed as a verb in a dictionary, as required by the lexical constraint <$Root=:V>. But the word form “ha” itself must not be recognized as a valid lexical entry from the dictionary. Hence, we associated the lexical entry “ha” with the feature +NW (non-word):


This functionality allows linguists to enter “non-words” in their dictionaries; these non-words are used by NooJ’s morphological parser, but they will never be produced as plain annotations in a Text Annotation Structure. Note that NooJ’s Hungarian module contains indeed a dictionary of “non-word” stems, which are associated with a series of morphological grammars.

12.6 Special Category NW

The special category NW is used in a slightly different manner from the special feature +NW. It too must be associated to “non-words” or “artifacts”, i.e. lexical entries that must not result in real annotations, and thus do not show up in the Text Annotation Structure. The difference between these two codes is that lexical entries associated with the NW category are still visible to NooJ’s Syntactic Parser, whereas lexical entries associated with the feature +NW are not.

In consequence, the NW category can be used to describe components of expressions that can be parsed by NooJ’s syntactic parser, but that do not occur as autonomous lexical entries. For instance, consider the following entries:




These lexical entries will never produce lexical annotations, and are not displayed as annotations in the Text Annotation Structure. However, they can still be managed by NooJ’s syntactic parser. For instance, the following syntactic grammar:

Figure 33. Latin Adverbs

will reconize the sequence “a priori” (“a” followed by a non word <NW>), and then will annotate it as an adverb ( <ADV+LATIN>), even though the component “priori” will not be annotated.

Topic attachments
I Attachment Action Size Date Who Comment
JPEGjpg pic.jpg manage 57.7 K 2009-08-11 - 18:18 UnknownUser  
JPEGjpg pic_1.jpg manage 28.3 K 2009-08-11 - 18:19 UnknownUser  
PNGpng pic_10.png manage 12.9 K 2009-08-11 - 18:19 UnknownUser  
PNGpng pic_11.png manage 8.9 K 2009-08-11 - 18:19 UnknownUser  
PNGpng pic_12.png manage 10.4 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_13.jpg manage 30.2 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_14.jpg manage 22.9 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_15.jpg manage 18.4 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_16.jpg manage 22.3 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_17.jpg manage 25.3 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_18.jpg manage 24.2 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_19.jpg manage 22.2 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_2.jpg manage 23.6 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_20.jpg manage 20.0 K 2009-08-11 - 18:20 UnknownUser  
JPEGjpg pic_21.jpg manage 21.7 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_22.jpg manage 18.7 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_23.jpg manage 17.1 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_24.jpg manage 22.9 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_25.jpg manage 32.3 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_26.jpg manage 24.0 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_27.jpg manage 36.9 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_28.jpg manage 46.9 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_29.jpg manage 52.1 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_3.jpg manage 22.7 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_30.jpg manage 39.3 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_31.jpg manage 20.7 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_32.jpg manage 21.0 K 2009-08-11 - 18:22 UnknownUser  
JPEGjpg pic_4.jpg manage 22.0 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_5.jpg manage 19.6 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_6.jpg manage 21.1 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_7.jpg manage 30.6 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_8.jpg manage 26.8 K 2009-08-11 - 18:19 UnknownUser  
JPEGjpg pic_9.jpg manage 35.9 K 2009-08-11 - 18:19 UnknownUser  
Topic revision: r2 - 2009-08-11 - MaxSilberztein
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback