REGULAR EXPRESSIONS AND GRAMMARS

The second part shows you how to carry out complex searches in texts with regular expressions (chap. 4), how to use lexical resources for linguistic queries (chap. 5), and how to use NooJ’s grammar editor to describe more powerful queries (chap. 6).

Chapter 4. Regular Expression

4.1. Disjunction

Load the text file “_Portrait of a lady” ( File > Open > Text). Display the locate window ( Text > Locate). Now type in the following NooJ regular expression (spaces are optional):

never + perhaps

In NooJ, the disjunction operator (also known as the UNION, or the “or”) is symbolized by the “+” character.


The disjunction operator, introduced in NooJ as the character “+”, tells NooJ to locate all of the utterances for “never” or “perhaps” in the text.

Make sure there is no limitation to the search: select “All matches” in the Limitation zone, at the bottom of the Locate panel. Since these adverbs are very frequent, we are expecting a high number of matches. If we had left the option “Only 100 matches”, the search would have been limited to the first 100 matches.

Figure 1. Enter a NooJ regular expression

Now click one of the colored buttons at the bottom of the Locate panel: the search is launched. After a few seconds, you should get a concordance with 696 entries. Click on the “After” column header in order to sort the concordance entries according to their right context (if you are interested by what occurs after these adverbs).

Figure 2. Concordance for the expression: never+perhaps

4.2. Parentheses

We want to locate the sequences made up of the word “her” or “his”, followed by the word “voice”. To do this, display the locate window ( Text > Locate), then enter the following regular expression:

(her + his) voice

Click on a colored button (not the color you already selected for the previous concordance). NooJ should find 19 occurrences of the sequence, her voice or his voice. Launch the search once more but this time do not use any parentheses:

her + his voice

This time, NooJ recognized 4,495 utterances:

Figure 3. Forgotten parentheses

What happened? NooJ has indexed two sequences: “her” and “his voice”. The blank space, called a concatenation operator, used here between the words his and voice, takes priority over the “or” operator “+”.


In the former regular expression, the parentheses were used to modify the order of priorities, so that the scope of the “or” (the disjunction operator) be limited to her or his.

In regular expressions, blanks (also named concatenation operators), have priority over the disjunction operator. Parentheses are used to modify the order of priority.

4.3. Sets of forms

We will now locate all of the utterances for the verb to be. In the Text menu, click on Locate to reload the locate window. Select the option “a NooJ regular expression”, then type in (A):

am+are+is+was+were

In the lower left hand corner (B), under Limitation, make sure the radio button “All matches “ is selected, then click on a colored button to launch the search.

Figure 4. Locate a set of forms

The disjunction operator allows you to undertake several searches at a time; in this example, the forms are all inflectional forms of the same word, but one could also locate spelling variations, such as:

csar + czar + tsar + tzar

names and their variations, such as:

New York City + Big apple + the city

terminological variants:

camcorder + video camera

morphologically derived forms:

Stalin + stalinist + stalinism + destalinization

Or expressions, terms or forms that represent similar concepts:

(credit + debit + ATM + visa) Card + Mastercard

Disjunctions therefore turn regular expressions into a powerful tool to extract information from texts.

4.4. Using lower-case and upper-case in regular expressions

In a regular expression, a word written in lower-case recognizes all of its variations in a text. The following expression, for example:

it

also recognizes the four word forms:

IT, It, it, iT

On the other hand, a form that contains at least one upper-case letter in a regular expression will only recognize identical forms in texts; for example:

It

will recognize only the form “It”. If you want to recognize the form “it” only when it is written in lower-case, use the quotation marks:

“it”

will recognize only the form “it”.

4.5. Exercises

Study the use of the word girl in the novel “The portrait of a lady”. How many times this word is used in the plural? In how many different multi-word units does this word form occurs?

How many times the form death occurs in the text; in how many idiomatic or metaphoric expressions?

Study the use of the word forms up and down: how many times these word forms correspond to a particle; how many times do they correspond to a preposition?

Locate in the text all occurrences of names of days ( Monday ... Sunday).

4.6. Special symbols

The following regular expression:

(the + a) <WF> is

finds all of the sequences made up of the word form “the” or “a”, followed by any word form (<WF>), followed by the form “is”.

All NooJ special symbols are written between angle brackets “<” and “>”. Do not leave any blank space between the angle brackets, and respect the case (upper-case for the symbol <WF>). If you apply exactly the former expression to the text, you should obtain a concordance that looks like the one below.

Figure 5. Apply a query with a special symbol

Note the importance of the angles; the following regular expression:

(the + a) WF is

represents the two literal sequences “the WF is” and “a WF is”. There isn’t much chance of you finding that sequence in this text…


IMPORTANT: <WF> is a special symbol. In NooJ, all special symbols are written between angle brackets.

Following is a list of NooJ symbols, as well as their meaning:

Special Symbol

Meaning

<WF>

word form (Sequence of letters)

<L>

word form, with length 1

<LOW>

word form in lower-case (sequence of lower-case letters)

<W>

word form in lower-case, with length 1

<UPP>

word form in upper-case (sequence of upper-case letters)

<U>

word form in upper-case, with length 1

<CAP>

word form in capital (an upper-case letter followed by lower-case letters)

<NB>

sequence of digits

<D>

one digit

<P>

delimiter (one character)

<^>

Beginning of a text unit

<$>

End of a text unit

<V>

a vowel

Following are a few expressions that contain special symbols:

We are looking for all the paragraphs that start with a word form in capital, followed with a form in lower-case, and then a colon:

<^> <CAP> <LOW> ,

Apply this query to the text “The portrait of a lady”: you should get 155 matches.

Now we want to locate the word forms in upper-case that occur at the beginning of paragraphs, or after a comma, and are followed by the word form “said”:

(<^> + ,) <CAP> said

(there are 6 occurrences in the text). Now we want to locate all sequences of two consecutive forms written in upper-case letters:

<UPP> <UPP>

Figure 6. Search for a sequence of two upper-case forms

Now locate the word forms that occur between “at” or “in” and “of” (there should be 142 matches):

(at + in) <WF> of

4.7. Special characters

The blank

NooJ processes any sequence of spaces, tabulation characters, and line change (codes “NEW LINE and “CAR RET”) characters as blanks. When entering a regular expression, blanks are usually irrelevant and are therefore optional.

Generally, one does not search for blanks:

In morphology, the range of the search is limited to the word form, in which there is no blank, by definition;

In syntax, blanks are always implicit; the expression <WF><WF>, for example, recognizes any sequence of two word forms (that are naturally always separated by a blank).

The following expression, for example:

<NB> ,

recognizes any sequence of consecutive digits that are directly followed by a comma, but also those that are followed by a blank (in NooJ terms, i.e., any sequence of spaces, line changes, or tab characters). Both of the following sequences are recognized by the previous expression:

1985,

1734 ,

Double quotes

However, it is sometimes necessary to specify “mandatory” blanks; in which case, we can use the double quotes to make the space explicit. The following is a valid regular expression:

<NB> “ ” ,

that recognizes only digit sequences that are followed by at least one space and a comma. Note that between the space and the comma, there might be extra spaces.

More generally, double quotes are used in NooJ to protect any sequence of characters that would otherwise have a particular meaning in the writing of a regular expression (or, as we will later discover, of a tag in a graph). For example, if we want to locate in a text all the single word forms in parentheses, we would enter the expression:

“(” <WF> “)”

Similarly, if we want to locate uses of the character “+” between numbers:

<NB> “+” <NB>

Double quotes are not useful in the following case: the expression in the second line is simpler and equivalent to the first line:

“1234” “&” “VXII” “.”

1234 & XVII .

Double quotes are used to perform exact matches. Note that the following two expressions are not equivalent:

“is” “A:B”

is A:B

“is” in the first expression only recognizes the lower-case word form “is”, not “IS” nor “Is”; “A:B” does not recognize the variations with a space such as “A : B”.

The sharp character “#”

The sharp character is used to forbid the use of a space. For example, when locating decimal numbers with a comma (and to avoid confusion with the use of the comma as punctuation), one could use the following expression:

<NB> # , # <NB>

How do we enter the query: “a sequence of digits followed by exactly one space, and then a comma”? the following regular expression can be used:

<NB> “ ” # ,

the sharp character (“#”) matches if and only if there is no blank at the current position in the text. Note that the following regular expression will never recognize anything:

<NB> # “ ”

because if right after the sequence of digits, there is no blank, then the “ ” will never match anything.

4.8. The empty string “<E>”

The <E> special symbol represents the empty string, in other words the neutral element of the concatenation operation. It is generally used to note an optional or elided element. For example, to represent the two variables:

a credit card + a card

One can use the following, more compact version:

a (credit + <E>) card

Similarly, if one wants to locate the utterances for the form “is” followed, within a context of two words, by “the”, “this” or “that”, one can use both of the following expression:

is ((the+this+that) + <WF> (the+this+that) +
<WF> <WF> (the+this+that))

But the following expression is in general more compact and legible:

is (<E> + <WF> + <WF> <WF>) (the+this+that)

4.9. The Kleene operator

The Kleene operator is used to indicate any number of utterances. For example, if one is locating the matches for the form “is” followed by any number of word forms, followed by the word form “the”, the following expression would be used:

is <WF>* the

Note that the number of forms is unlimited, and includes zero: the previous expression is equivalent to the following infinite expression:

is (<E> + <WF> + <WF> + <WF><WF> + ...) the

In the same manner, the following expression:

the very* big house

recognizes an unlimited number of sequences:

the big house, the very big house, the very very big house,
the very very very big house, the very very very very big house,...

When using the Kleene operator to specify an insertion of unlimited length, be careful not to forget potential delimiters. For example, to recognize the sequences made up of the word form “is”, then of any possible insertion, then of the word form “by”, you should enter the expression:

is (<WF> + <P>)* by

Figure 7. Arbitrary sequences in a pattern

(Note that you can change the length of the left and right contexts in the concordance).

We will see later that regular expressions are also used in NooJ’s inflectional and derivational morphology module.


Summary:

You have learned to write a few elementary syntactic regular expressions:

-- the blank (concatenation operator) allows you to build sequences of words;

-- the “+” (disjunction operator) allows you to select alternate sequences;

-- the “*” (Kleene operator) is used to mark unlimited repetitions.
-- the <E> symbol (the empty string) is the neutral element for the concatenation.


The Kleene operator takes priority over concatenation, which takes priority over disjunction (“or” operator). Parentheses can be used to change the order of priority.

Chapter 5. Using lexical resources in Grammars

5.1. Using lexical resources in Grammars

Previously, we located the conjugated forms of the verb to be by using the following expression:

am + are + is + was + were

We could also add the following forms to the expression:

be + been + being

While the resulting expression would be perfectly valid, that would certainly be very tedious. Fortunately, it is possible to use lexical information to greatly simplify this type of queries.

For each language, NooJ accesses a dictionary (further described later) in which each word of that language is an entry, and is associated to some morphological information, usually, its inflectional and/or derivational paradigms. The inflectional paradigm tells NooJ what inflected forms the lexical entry accepts, i.e. what are its conjugated forms (if it is a verb), its feminine and plural forms (for nouns in Romance languages), its accusative, dative, genitive etc. forms (for Germanic languages), etc. The derivational paradigm tells NooJ what derived forms the lexical entry, i.e. what word can be derived from the entry. For instance, from the noun “color”, we can construct the verb “to colorize”, from the verb “to drink”, we can construct the adjective “drinkable”, etc.

Thanks to this dictionary, NooJ can link all inflected or derived forms together. All these forms are then stored in an equivalence set. We access this equivalent set simply by entering any member of the set between angle brackets. For instance, the following expression, in which we refer to the word form “be”, represents all of the inflected forms of “to be”, in which we are interested.

<be>

Load the text “_Portrait of a lady” ( File > Open > Text). Display the locate window ( Text > Locate). Enter this regular expression in the text’s “Locate” panel. Type it in exactly as it is above: do not confuse the angles “<” and “>” with the brackets “[” and “]”; do not insert any spaces; make sure that you type be in lower-case. Apply this expression without any limitations to the text “The Portrait Of A Lady”. NooJ should find 7,484 utterances. In the same manner, re-launch the search by using the symbol:

<was>

NooJ should find the same 7,484 utterances, i.e. exactly like previously because the two word forms “be” and “was” belong to the same equivalence set.

Always remember to write the angles! re-launch the search but without typing in the angles:

be

This time, NooJ only locates the utterances for the literal word form “be” (only 1,366 occurrences).


In a regular expression, when a form is written as is (e.g. be), NooJ locates the utterances of the word form itself. On the other hand, when the word form is set between angle brackets, NooJ locates all of the word forms that are in the same equivalence set as the given word form (generally all inflected, derived forms or spelling variants of a given lexical entry).

5.2. Indexing a category

In NooJ’s dictionaries, all entries are associated to a morpho-syntactic category. We may then refer to this category in regular expressions. For example, to locate all of the sequences containing any form associated with the lemma “be”, followed by a preposition, then a noun, enter the following expression:

<be> <PREP> <N>

(In NooJ’s English dictionary, PREP stands for Preposition, and N denotes Noun). Launch the search; NooJ should show 373 sequences.

Figure 8. Use lexical information in regular expressions

NooJ will locate all of the sequences made up of any word form in the same equivalence set as “be”, followed by any word form associated with the “PREP” category, followed by any word form associated with the “N” category.

The following symbols are references for the codes found in NooJ’s English dictionary:

Code

Meaning

Examples

A

Adjective

artistic, blue

ADV

Adverb

suddenly, slowly

CONJC

Coordination conjunction

and

CONJS

Subordination conjunction

if, however

DET

Determiner

this, the, my

INT

Interjection

ouch, damn

N

Noun (substantive)

apple, tree

PREP

Preposition

of, from

PRO

Pronoun

me, you

V

Verb

eat, sleep


Note: these codes are not set by NooJ itself, but rather by its dictionary. NooJ does not know what the symbol “ADV” means: in order to recognize the special symbol <ADV>, NooJ consults its dictionaries, and verifies if the word is therein associated with the code ADV.

In other words, linguists and lexicographes who are using NooJ are totally free to invent their own category and codes (e.g. DATE, VIRUS or POLITICS).


Important: Users may add their own codes to the system, either in new, personal dictionaries or by modifying the system’s dictionaries. The new codes must always be written in upper-case. They are immediately available for any query, and may be instantly used in any regular expression or grammar (just write them between angle brackets).

Before adding new codes to the system, you should verify that they do not conflict with codes used in other dictionaries. For example, do not enter a list of professions with the code <PRO> if you plan to use NooJ’s default dictionaries, because this code is already used for the pronouns.

Conversely, if you add a list of terms that have the function of a substantive, it is preferable to code them “N” rather than, say, “SUBS”, so that queries and grammars you might want to write may access all nouns with one single symbol.

We will now locate the sequences of the verb “to be”, followed by an optional adverb, a preposition, then the determiner “the”. Reactivate the Locate window and enter the following expression:

<be> (<ADV> + <E>) <PREP> the

Launch the search; NooJ should find the corresponding sequences.

Figure 9. Another regular expression

5.3. Combining lexical information in symbols

In NooJ’s dictionaries, entries are associated with at least one morpho-syntactic code. They may also be described with other types of information, and all of the information available in these dictionaries may be used in queries or in grammars. For example, here is an entry from the English dictionary:

virus,N+Conc+Medic

This entry states that the word “virus” is a noun (N), belonging to the distributional class Concrete (Conc) and is used in the semantic domain Medical (Medic).

Syntactic and semantic information

All pieces of information in NooJ are represented by codes prefixed with the character “+”.


Warning: do not confuse the “+” character in dictionaries and the disjunction operator in regular expressions.

One can use these codes in queries, to the right of a word form or of a category. For example, <fly+tr> could denote transitive uses of the verb to fly, and <N+Medic> represents all the nouns that are associated with the medical semantic domain.

Symbols in queries can include negations. For instance, <fly-tr> would denote non-transitive uses of the verb to fly, and <N-Medic> represents all the nouns that are not associated with the medical semantic domain.

One can combine these codes as much as needed. For example <N+Hum-Pol> represents human (+Hum) nouns that do not belong to the semantic domain “Politics”. Codes are not ordered: for instance, the previous symbol is equivalent to <N-Pol+Hum>.

Warning: Codes are case sensitive. For example, the codes “+Hum”, “+hum”, “+HUM” would represent three different codes to NooJ, and the symbol <N+Hum> does not match a lexical entry associated with the code “+HUM” or “+hum”.

Inflexional Information

In NooJ, any piece of lexical information (including inflectional codes) is encoded the same way, i.e. with the prefix character “+”. However, inflectional codes are not, usually, visible in NooJ’s dictionaries, because NooJ’s dictionaries contain lemmas, rather than conjugated forms. We will see later that inflectional codes are described in the inflectional-derivational description files (.nof files).

However, it is important to know what these codes are, because they can be used (questioned) exactly as syntactic or semantic codes. Here are the inflectional codes that are used in NooJ’s English dictionary:>Both in symbols and in dictionaries, information codes are not ordered. For example <V+PR+3+s> and <V+3+PR+s> match the same utterances. NooJ allows any partial queries, for example, <be+PR> represents all of the forms of the verb to be conjugated in the Present tense, and <be+3+s> matches both forms “is” and “was”. Below are the inflectional codes that are used in NooJ’s French dictionary:

Code

Signification

s

Singulier

p

Pluriel

1, 2, 3

1ère, 2ème, 3ème personne

PR

Présent de l’indicatif

F

Future

I

Imparfait

PS

Passé simple

S

Subjonctif présent

IP

Impératif présent

C

Conditionnel présent

PP

Participe passé

G

Participe présent

INF

Infinitif

It is possible to combine queries on a word and on a category: for example, the symbol <admit,V+PR> matches all the forms of the verb “admit” conjugated in the present. We will see that these complex queries are useful when a text has been partially (or totally) disambiguated.

5.4. Negation

NooJ processes two levels of negation in symbols used in regular expressions:

-- as we have just seen, one can prefix any of the properties of a lexical entry with the character “-” instead of the character “+”; in that case, only word forms that are not associated with the feature will match; for instance, <N-Hum> matches non-human nouns.

-- another, more global negation: one can match all the word forms that do not match a given lexical symbol, by prefixing the symbol with the character “!”. For instance, <!V> matches all the word forms that are not annotated as verbs; <!have> matches all the word forms that are not annotated with the lemma “have”; <!N+Hum+p> matches all the word forms that are not annotated as plural human noun.


Warning: negations often appear to produce obscure, inexpected results in NooJ, because of the huge level of ambiguity that is produced by dictionaries.

For instance, consider the following untagged text:

I left his address on the table

the query <!V> would match all the word forms in the previous sentence, including “left”, because this form is also associated with the lexical entry left = Adjective, therefore “left” can be a non-verb.

<!N> also matches all the forms, including “address” and “table”, because both forms are also associated with lexical entries that are not nouns (the verbs “to address” and “to table”).

As a consequence, I strongly suggest to limit the use of the negation in queries that are applied to texts after they have been disambiguated. We will see later how to perform disambiguation with NooJ. As a matter of fact, all these problems disappear if one works with the same text, after it has been tagged the following way:

I {left,leave.V} his {address,.N:s} on the {table,.N:s}

Here, the expressions <!V> and <!N> would produce the expected results..

5.5. PERL-type Expressions

It is also possible to apply a SED-GREP-PERL-type regular expression to constraint the word forms that match a certain pattern. To do so, use the +MP=”...” feature. For instance:

<ADV+MP=”ly$”>

matches all the adverbs (“ADV”) that end with “ly” (in the PERL pattern: “ly$”, the “$” character represents the end of the word form). In the same manner:

<UNK-MP=”^[A-Z]”>

matches all the unknown words (“UNK”) that do not start with an uppercase letter (the PERL special character “^” represents the beginning of the word form; the set [A-Z] represents any of the characters A, B, ... Z). Finally, it is possible to combine more than one PERL-type matching pattern:

<N+Hum-MP=”or$”-MP=”[Aa]”>

human nouns that do not end with “or”, and do not contain any “a” or “A”.


Symbols in regular expressions represent:
-- word forms characterized by their case; e.g. <LOW> matches all lowercase word forms;
-- word forms that belong to a morphological equivalence set; e.g. <have> matches all forms of the verb “to have”;
-- word forms associated with a morpho-syntactic category; e.g. <PREP> matches all prepositions;

-- any number of the codes that are available in NooJ’s dictionary can be used in combinations; e.g. <N+HUM+Medic+p> (plural human noun of Medical vocabulary) or <V+tr+Pr-3> (any transitive verb conjugated in the Present and not in the third person);

-- negations can be used either globally (with the “!” character) or for each property (with the “-” character).

5.6. Exercises

(1) Extract the passive sentences from the novel “The portrait of a lady”.

Look for all the conjugated forms of the verb “to be”, followed by a past participle and the preposition “by”, in order to find sequences such as “... were all broken by...”. Then generalize the pattern to recognize negations as well as adverbial insertions.

(2) The word form “like” is ambiguous because it is either a verb (e.g. “I like her”) or a preposition (e.g. “like a rainbow”). Build the concordance of this form in the text; from this concordance, design two regular expressions that would disambiguate the form, i.e. one regular expression to recognize only the verbal form, and one to recognize only the preposition.

Start by studying the unambiguous minimal contexts in which this form is unambiguous, for instance “(I + you + we) like”, “(should + will) like”, “to like”.

(3) Extract from the text all the sentences that express future.

Extract sequences that contain “will” or “shall” followed by an infinitive verb; then extend the request to find constructs such as “I am going to eat”, “I won’t work tomorrow” and “I’ll come back in a few weeks”.

Chapter 6. The GRAMMAR EDITOR

now, we have used regular expressions in order to describe and retrieve simple morpho-syntactic patterns in texts. Despite their easy use and power, regular expressions are not well suited for more ambitious linguistic projects, because they do not scale very well: as the complexity of phenomena grow, the number of embedded parentheses rises, and the expression as a whole quickly become unreadable. This is when we use NooJ graphs.

6.1 Create a grammar

In NooJ, grammars of all sorts are represented by organized sets of graphs.

A graph is a set of nodes, some of them being possibly connected, in which one distinguishes one initial node, and one terminal node. In order to describe sequences of letters (at the morphological level) or sequences of words (at the syntactic level), one must “spell” these sequences by following a path (i.e. a sequence of connections) that starts at the inital node of the graph, and ends at its terminal node.

Select in the menu File > New > Grammar. Selecting the language “en” (for English) both in the INPUT and in the OUTPUT parts of the grammar, then click the button: Create a Syntactic Grammar. A window like the following figure should be displayed; this grammar contains already two nodes: the initial node is represented by an horizontal arrow, and the terminal node is represented by a crossed circle. You can move these nodes by dragging them: move the initial node to the left of the graph, and the terminal node to the right, just like in the following figure:

Figure 10. An empty graph contains already an initial node and a terminal node

Basic operations

In order to create a node somewhere in the window, position the cursor where you want the node to be created (anywhere but on another node), and then Ctrl-Click (i.e. hit one of the Ctrl key on the keyboard, keep the key down, then click with the left button of the mouse, then release the Ctrl key).

When a node has just been created, it is selected (it should be displayed in blue, by default). Enter the text “the” (this will be the label of the node), then validate by hitting Ctrl-Enter (press the Control key, and then the Enter key at the same time).

In order to select a node, click it. In order to unselect a node, click anywhere on the window (but not on a node). Make sure you deselect the previous node by clicking anywhere in the window (but not on a node). Then, create a node ( Ctrl-Click somewhere, but not in a node), enter the label “<WF>”, then validate with Ctrl-Enter. Finally create a third node labeled with “this”.

In order to delete a node, select it (click it), then erase its label, then validate with the Ctrl-Enter key (a node with no label is useless, therefore it is deleted).

In order to connect two nodes, select the source one (click it), then the target one. In order to unconnect two nodes, perform the same operation, as if you wanted to connect them again: click the source one, then the target one.


Warning: if you double-click a node, NooJ understands that you connect the node to itself; therefore it creates a loop on this node. To cancel this operation, just double-click again this node: that will delete the connection.

Now is your turn: connect the initial node to the node labeled “the”, and then to the node labeled “this”. Then select the two nodes “the” and “this”, and connect them to the node “<WF>”. Then connect the latter node to the terminal node. You should obtain a graph like the following:

Figure 11. Graph that recogizes “the” or “this”, followed by any word form

This graph reconizes all the sequences that start with “the” or “this”, followed by any word form (remember that “<WF>” stands for any word form). For instance: “this cat”, or “the house”. Make sure to save your grammar: File > Save (or Ctrl-S); give it a name such as “my first grammar”.

You might have made mistakes, such as:

-- create an extra, unwanted node; in that case, just selected the unwanted node, then delete its label, then validate by hitting Ctrl-Enter (this destroys the unwanted node);

-- create extra, unwanted (loops or reverse) connections; in that case, select the source node of the unwanted connection, then select its target node (this destroys the connection).

6.2. Apply a grammar to a text

As soon as a graph is saved, one can immediately apply it to any text. If you have not already done that, load the text “A portrait of a lady”, then call the Locate Panel ( Text > Locate Pattern):

Figure 12. Applying a graph to a text

This time, instead of entering a regular expression, we are going to apply a grammar. (A) Select the option a NooJ grammar; a dialog box is displayed, that asks you to enter a graph name; enter it (or select it from the file browser), then validate (hit the Enter key, or click the Open button).

Finally, click one colored button at the bottom right end of the “Locate panel”. NooJ launches the search, then displays the number of matches found; click OKYou should get a concordance similar to the following figure:

Figure 13. Concordance of the grammar equivalent to the regular expression: (the+this) <WF>

6.3. Create a second grammar

Select in the menu File > New > Grammar, then select the languages English/English, then click the “syntactic grammar” button. A new empty grammar is displayed (that already contains the initial and terminal nodes). Create three nodes with the following labels:

there is no
this are none
was not
were nothing

To write a disjunction (e.g. there or this), press the Enter key, so that each term appears on one line. Finally, connect the nodes as in the following figure. Save the graph ( File > Save) with a name such as “my second grammar”.

Figure 14. Another grammar

Open the Locate panel ( TEXT > Locate), select the option a NooJ grammar, select your grammar file name, then click a colored button. NooJ applies the grammar to the text, and then displays the concordance.

Note that, because linguistic resources are available for this text, one could have entered the symbol “<be+3>” (any conjugated form of “to be”, conjugated at the third person), instead of the expression “is+are+was+were”.

6.4. Describe a simple linguistic phenomena

Grammars are used to extract sequences of interest in texts, but also to describe various linguistic phenomena. For instance, the following French graph describes what sequences of clitics can occur between the preverbal pronoun il (= he) and the following verb.

Figure 15. A local grammar for preverbal particles

This graph recognizes the following valid French sequences:

Il dort (he sleeps),
Il le lui donne (he gives it to him),
Il leur parle (he talks to them),
Il me la prend (he takes her from me)

At the same time though, the following incorrect sequences would not be recognized by the grammar:

*Il lui le donne, *Il lui leur parle, *Il la me prend

Exercise: build this grammar, then generalize it by adding an optional negation (e.g. “il ne lui donne (pas)”), the elided pronouns m’, t’, s’, and the two pronouns en and y, in order to recognize all preverbal sequences, including the following ones:

Il t’en donne, Il m’y verra, Il ne m’y verra (pas)

Then, load the text “La femme de trente ans”, and apply the grammar to the to study its coverage.

Topic revision: r3 - 2009-07-05 - MaxSilberztein
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback