I've just checked in new code that makes it possible to create training material for doing N-best rescoring on speech translation applications - the functionality is basically the same as what we already had for dialogue applications, but there were a number of details that had to be fixed. It seems that the potential for improving performance using N-best rescoring varies considerably between apps. So far, we've looked at the following cases:
- Calendar: can already almost halve error rate using rescoring, more should be possible.
- Ford app: almost no potential for improvement.
- Paideia app: considerable potential for improvement (don't currently have figures)
- English MedSLT: maximum possible improvement looks like about 10% relative.
- French MedSLT: maximum possible improvement about 15-20% relative.
- Japanese MedSLT: almost no potential for improvement.