The new functionality improves the picture by trying to determine which parses can be kept, and only reparsing the remaining ones. The current rules for determining which new parses are required are as follows:
- After each invocation of EBL_TREEBANK, Regulus saves both the treebank and a copy of the grammar used to create it. The next time EBL_TREEBANK is called, the system compares the saved grammar and treebank with the current grammar and training corpus.
- The grammar comparison determines two things: 1) Have any non-lexical rules changed? 2) If only lexical rules have changed, which lexical items are affected?
- If non-lexical rules have changed, the whole treebank needs to be reparsed. Most often, however, this is not the case. If no rules, or only lexical rules, have changed, the treebank is incrementally updated as follows.
- Any items in the treebank which correspond to sentences no longer in the current training corpus are removed.
- Any items in the current training corpus which do not occur in the old treebank are parsed and added to the new treebank.
- Any items in the treebank which include changed lexical items are reparsed and added to the new treebank.
- All remaining items in the old treebank are kept.
I have done some testing, and things appear OK, but I know from experience that this kind of non-monotonic code often contains subtle bugs which aren't immediately apparent. Please let me know if things don't work as expected, and I will give priority to sorting out problems. If necessary, you can toggle the incremental treebanking functionality using the new commands INCREMENTAL_TREEBANKING_OFF and INCREMENTAL_TREEBANKING_ON. By default, incremental treebanking is on.
No comments:
Post a Comment