Thursday 11 September 2008

Incremental treebanking for grammar specialisation

I have just checked in some new code, which should make the process of creating a specialised grammar much more efficient. The most time-consuming part of the process is parsing the treebank, using the EBL_TREEBANK command, or commands like EBL_ANALYSIS which call it indirectly. Until now, the whole set of training sentences had to be parsed every time. This was wasteful, since the greater part of the parses in the existing treebank were often still valid.

The new functionality improves the picture by trying to determine which parses can be kept, and only reparsing the remaining ones. The current rules for determining which new parses are required are as follows:
  • After each invocation of EBL_TREEBANK, Regulus saves both the treebank and a copy of the grammar used to create it. The next time EBL_TREEBANK is called, the system compares the saved grammar and treebank with the current grammar and training corpus.
  • The grammar comparison determines two things: 1) Have any non-lexical rules changed? 2) If only lexical rules have changed, which lexical items are affected?
  • If non-lexical rules have changed, the whole treebank needs to be reparsed. Most often, however, this is not the case. If no rules, or only lexical rules, have changed, the treebank is incrementally updated as follows.
  • Any items in the treebank which correspond to sentences no longer in the current training corpus are removed.
  • Any items in the current training corpus which do not occur in the old treebank are parsed and added to the new treebank.
  • Any items in the treebank which include changed lexical items are reparsed and added to the new treebank.
  • All remaining items in the old treebank are kept.
You need to update Regulus to get the new functionality. Note that nothing will happen the first time you do EBL_TREEBANK after the update, since the old copy of the grammar is saved after EBL_TREEBANK is invoked, and you will not originally have an old saved grammar. So you will only notice a difference the second time you do EBL_TREEBANK.

I have done some testing, and things appear OK, but I know from experience that this kind of non-monotonic code often contains subtle bugs which aren't immediately apparent. Please let me know if things don't work as expected, and I will give priority to sorting out problems. If necessary, you can toggle the incremental treebanking functionality using the new commands INCREMENTAL_TREEBANKING_OFF and INCREMENTAL_TREEBANKING_ON. By default, incremental treebanking is on.

No comments: