Monday 16 June 2008

Better ways to estimate semantic error rate

I've just added some code to automatically estimate semantic error rate for translation applications. It does more or less the same thing as the code we've had for a while in dialogue apps, and counts an example from a speech corpus as semantically correct if it produces the same interlingua as the transcription would have done.

Unfortunately, the problem with this definition is that it doesn't work for utterances that are in domain, but out of grammar coverage. For example, I was just looking though the results for the English MedSLT corpus. In one example, the transcription is "does the pain ache", which is out of grammar coverage. The first hypothesis which produces well-formed interlingua is "does the pain feel aching", which is a good paraphrase and is selected. So this should really be counted as semantically correct, but isn't.

I think we can address the problem by allowing the developer to declare a file of paraphrases, and say that the example is semantically correct if it gives the same result as either the actual transcription or one of its paraphrases. Then if the developer adds in-coverage paraphrases where they exist, things will work correctly. This should be easy to implement. Probably we want a warning if a paraphrase in fact is also determined to be out of coverage.

This paraphrase functionality should also be useful for the N-best rescoring work that Maria and I have been doing for dialogue apps. We have the same problem there - we want to be able to experiment with out of coverage examples, but currently get no figures.

No comments: