Thursday, 18 September 2008

Dynamic Regulus lexicon entries

Regulus now includes an interface to Nuance dynamic grammar capabilities, making it possible in effect to add new lexicon entries at runtime. Dynamic lexicon entries need to be defined using macros which have been declared dynamic in the Regulus source file.

I have checked in a sample application in $REGULUS/Examples/Toy1SpecialisedDynamic; there is basic documentation in doc/README.txt. The application uses a version of the Toy1Specialised grammar in which commands need to be prefaced by a name. The user can dynamically add new names to the recognition vocabulary while the application is running. The following extract from the lexicon file shows the macro and declaration for the dynamic name entries:

macro(person_name(Surface, Sem),
@name(Surface, [Sem], [agent], sing, [])).

dynamic_lexicon( @person_name(Surface, Sem) ).

At runtime, new name entries can be added using calls to the predicate assert_dynamic_lex_entry/1. A typical call might look like this:

assert_dynamic_lex_entry( @person_name((howard, the, duck), howard_the_duck))

Note that the infrastructure needed to run dynamic applications is somewhat different from the standard one. In particular, it is necessary to use a Resource Manager and a Compilation Server,and compile a dummy "just-in-time" recognition package. The sample application gives examples of the scripts required. I will be checking in proper documentation soon, andwill post again when I have done that.

Thursday, 11 September 2008

Incremental treebanking for grammar specialisation

I have just checked in some new code, which should make the process of creating a specialised grammar much more efficient. The most time-consuming part of the process is parsing the treebank, using the EBL_TREEBANK command, or commands like EBL_ANALYSIS which call it indirectly. Until now, the whole set of training sentences had to be parsed every time. This was wasteful, since the greater part of the parses in the existing treebank were often still valid.

The new functionality improves the picture by trying to determine which parses can be kept, and only reparsing the remaining ones. The current rules for determining which new parses are required are as follows:
  • After each invocation of EBL_TREEBANK, Regulus saves both the treebank and a copy of the grammar used to create it. The next time EBL_TREEBANK is called, the system compares the saved grammar and treebank with the current grammar and training corpus.
  • The grammar comparison determines two things: 1) Have any non-lexical rules changed? 2) If only lexical rules have changed, which lexical items are affected?
  • If non-lexical rules have changed, the whole treebank needs to be reparsed. Most often, however, this is not the case. If no rules, or only lexical rules, have changed, the treebank is incrementally updated as follows.
  • Any items in the treebank which correspond to sentences no longer in the current training corpus are removed.
  • Any items in the current training corpus which do not occur in the old treebank are parsed and added to the new treebank.
  • Any items in the treebank which include changed lexical items are reparsed and added to the new treebank.
  • All remaining items in the old treebank are kept.
You need to update Regulus to get the new functionality. Note that nothing will happen the first time you do EBL_TREEBANK after the update, since the old copy of the grammar is saved after EBL_TREEBANK is invoked, and you will not originally have an old saved grammar. So you will only notice a difference the second time you do EBL_TREEBANK.

I have done some testing, and things appear OK, but I know from experience that this kind of non-monotonic code often contains subtle bugs which aren't immediately apparent. Please let me know if things don't work as expected, and I will give priority to sorting out problems. If necessary, you can toggle the incremental treebanking functionality using the new commands INCREMENTAL_TREEBANKING_OFF and INCREMENTAL_TREEBANKING_ON. By default, incremental treebanking is on.