HFST: Date Parser

We examplify the use of HFST command line tools with an example taken from Beesley & Karttunen that creates a transducer that recognizes English date expressions from "Monday, January 1, 1" to "Sunday, December 31, 9999". $FORMAT is the implementation type of the transducer. The solution given on this page can also be executed with a single script.

We deliberately use the tool HfstRegexp2Fst sparingly and instead give examples of how the other HFST command line tools can be used to achieve the same results.

Note that there is a small error in Beesley & Karttunen's solution: define Month30 [{April} | {June} | {September} | {December}]; should be define Month30 [{April} | {June} | {September} | {November}];. This error is fixed in this solution.

Numbers from one to nine.

echo "[1|2|3|4|5|6|7|8|9]" | hfst-regexp2fst -f $FORMAT > OneToNine.hfst

Numbers from zero to nine.

echo "0" | hfst-strings2fst -f $FORMAT | hfst-disjunct OneToNine.hfst > ZeroToNine.hfst

Even numbers.

echo "[0|2|4|6|8]" | hfst-regexp2fst -f $FORMAT -j > Even.hfst

Odd numbers.

echo "[1|3|5|7|9]" | hfst-regexp2fst -f $FORMAT -j > Odd.hfst

Even and odd numbers.

hfst-disjunct Even.hfst Odd.hfst > N.hfst

Days of the week.

echo "Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday" | hfst-strings2fst -f $FORMAT -j > Day.hfst

A special month that usually has only 28 days.

echo "February" | hfst-strings2fst -f $FORMAT > Month28.hfst

Months that have 30 days.

echo "April
June
September
November" | hfst-strings2fst -f $FORMAT -j > Month30.hfst

Months that have 31 days.

echo "January
March
May
July
August
October
December" | hfst-strings2fst -f $FORMAT -j > Month31.hfst

All months.

hfst-disjunct Month29.hfst Month30.hfst | hfst-disjunct Month31.hfst > Month.hfst

Numbers from 1 to 31

echo ' [ (1|2) @"ZeroToNine.hfst" ] | 30 | 31 ' | hfst-regexp2fst -f $FORMAT > Date.hfst

Numbers from 1 to 9999

hfst-repeat -t 3 ZeroToNine.hfst | hfst-concatenate -1 OneToNine.hfst > Year.hfst

At this point, we can test if the transducers are correct by printing for each transducer, say, 5 random strings recognized by that transducer:

for i in *.hfst; do echo $i:; hfst-fst2strings -r 5 $i; echo "" ; done

Day or [Month and Date] with optional Day and Year, excluding leap dates.

echo ", " | hfst-strings2fst -f $FORMAT > CommaSpace.hfst;
echo " " | hfst-strings2fst -f $FORMAT > Space.hfst;
echo "" | hfst-strings2fst -f $FORMAT > Epsilon.hfst;

Day followed by a comma and a space. e.g. "Thursday, ".

hfst-concatenate Day.hfst CommaSpace.hfst | hfst-disjunct Epsilon.hfst > OptionalDay.hfst

Month and a date, e.g. "January, 14".

hfst-concatenate Month.hfst Space.hfst | hfst-concatenate -2 Date.hfst > MonthDate_.hfst

Constraints on dates 29, 30 and 31.

for i in 29 30 31; \
do
  echo $i | hfst-strings2fst -f $FORMAT > $i.hfst;
done

hfst-concatenate Month30.hfst Space.hfst | hfst-concatenate 31.hfst > Constraint30.hfst;
hfst-disjunct 30.hfst 31.hfst | hfst-disjunct 29.hfst > TMP.hfst;
hfst-concatenate Month28.hfst Space.hfst | hfst-concatenate TMP.hfst > Constraint28.hfst;

hfst-subtract MonthDate_.hfst Constraint30.hfst | hfst-subtract -2 Constraint28.hfst > MonthDate.hfst;

An optional year, e.g. ", 1995".

hfst-concatenate CommaSpace.hfst Year.hfst | hfst-disjunct Epsilon.hfst > OptionalYear.hfst

Get all valid dates, except leap dates.

hfst-concatenate OptionalDay.hfst MonthDate.hfst | hfst-concatenate -2 OptionalYear.hfst | hfst-disjunct Day.hfst > ValidDates.hfst

Get numbers divisible by 4. Of single digit numbers, 4 and 8 are divisible by 4. In larger numbers divisible with 4, if the penultimate is even, the last number is 0, 4, or 8. If the penultimate is odd, the last number is 2 or 6. This time we resort to the SFST programming language parser.

echo "4 | 8 | (0|1|2|3|4|5|6|7|8|9)* ( (0|2|4|6|8)(0|4|8) | (1|3|5|7|9)(2|6) )" | hfst-sfstpl2fst -f $FORMAT > Div4.hfst

Leap years are divisible by 4 but we have to subtract centuries that are not divisible by 400. Centuries that are not divisible by 400 are of format "a number that is not divisible by 4 followed by two zeros", e.g. 1500 or 2100.

echo "00" | hfst-strings2fst -f $FORMAT > 00.hfst;
hfst-repeat -f 1 N.hfst |  # all integers
hfst-subtract -2 Div4.hfst |  # all integers not divisible by 4
hfst-concatenate -2 00.hfst |  # all centuries not divisible by 4
hfst-subtract -1 Div4.hfst |  # all leap years
hfst-conjunct Year.hfst > LeapYear.hfst  # get rid of leap years that are bigger than 9999 

An optional leap year, e.g. ", 1916".

hfst-concatenate CommaSpace.hfst LeapYear.hfst | 
hfst-disjunct Epsilon.hfst > OptionalLeapYear.hfst

Construct leap dates.

hfst-concatenate Month28.hfst Space.hfst | hfst-concatenate 29.hfst | hfst-concatenate -1 OptionalDay.hfst | hfst-concatenate -2 OptionalLeapYear.hfst > LeapDates.hfst

Get all possible dates.

hfst-disjunct ValidDates.hfst LeapDates.hfst > Dates.hfst

We can now use the files false_dates:

February 29, 1900
Monday, February 29, 1700
Wednesday, December 32, 2003
June 31

and correct_dates:

February 29, 1916
Saturday, February 29, 1708
Thursday, December 31, 2005
July 31

and the tool hfst-lookup to test if the transducer Dates.hfst accepts all correct dates and rejects all false dates:

hfst-lookup -I false_dates -i Dates.hfst;
hfst-lookup -I correct_dates -i Dates.hfst


-- ErikAxelson - 2011-08-22