Tuesday, 24 June 2008

McNemar at the word level?

I was thinking about our paper for GoTAL, and one thing that's bothered me a little is that we did all the significance testing using SER - the reason was that it's easy to run a McNemar test. However, we got rather bigger improvements in WER, which is really what you would expect from SLMs.

It seems to me though that you should also be able to do McNemar at the word level. You look at each word in the transcription, and then check each of the two hypotheses you're comparing to see whether they include it. This is a little coarse-grained (you treat each sentence as a bag of words), but I'd guess it would still give interesting results. Shouldn't be at all hard to implement either. If we do an expanded version of the GoTAL paper, I'd definitely like to try this.

In fact, this idea is so obvious that either it's wrong, or someone must already have thought of it. Any idea which?

PS Jun 26. Beth Ann pointed out that the proposal as originally formulated only covered deletions, but it's trivial to extend it to do insertions too. More seriously, she wondered if the significance results would always be reliable, given that there may be subtle dependencies. I am really not sure about this, but one way to investigate the idea empirically would be to generate large sets of simulated recognition results using a stochastic process, and look at the distributions. For example, if you generate 10000 simulated recognition runs, then take one run and find all the other runs that come out as different from it at P < 0.01 according to the new statistic, you'd be reassured to find there were not more than 100 of them. A lot more, and something is presumably wrong. A lot less presumably just shows the test isn't very sensitive.

Faster parsing in Regulus using Nuance

Here's something I've been meaning to do for a while, that really should be moved up the priority stack. It should be quite easy to arrange things so that, in cases where we have compiled a grammar down to Nuance form, we use Nuance to do parsing - this ought to be much faster than the Regulus parser, and could really let us speed up corpus runs. There are at least two straightforward ways to implement it. One is to start an nl-tool process and pipe sentences into it, reading the analyses that come back. It may be even simpler to use the Regserver, now that we can connect to it from the Regulus top-level, and send an "interpret" message. More about this soon, I hope.

PS Jun 26. It was indeed very easy - I took the route of creating an nl-tool process and connecting to it with pipes. The new NUANCE_PARSER command now lets you use nl-tool as the parser. Parsing times are at least 30 times faster. Things should be checked in. More about this soon.

Tuesday, 17 June 2008

"Paraphrase corpora" for estimating semantic error rates

I've implemented a first cut at the "paraphrase corpus" idea that I suggested in yesterday's post. So far, it only works for speech translation, but it's rather nice to see that we can now measure the effect that N-best rescoring has on semantic error rate in a way that's both much quicker and much more objective than what we were doing previously. On the whole of the Eng corpus (the only one I've tried so far), semantic error rate on this metric is reduced by N-best rescoring by about 4% absolute, or 8% relative.

My next task here is to extend the method to dialogue processing - this should be easy, I think. We will then be able to do dialogue N-best rescoring experiments using out-of-coverage as well as in-coverage data, which should open up several new possibilities.

Monday, 16 June 2008

Better ways to estimate semantic error rate

I've just added some code to automatically estimate semantic error rate for translation applications. It does more or less the same thing as the code we've had for a while in dialogue apps, and counts an example from a speech corpus as semantically correct if it produces the same interlingua as the transcription would have done.

Unfortunately, the problem with this definition is that it doesn't work for utterances that are in domain, but out of grammar coverage. For example, I was just looking though the results for the English MedSLT corpus. In one example, the transcription is "does the pain ache", which is out of grammar coverage. The first hypothesis which produces well-formed interlingua is "does the pain feel aching", which is a good paraphrase and is selected. So this should really be counted as semantically correct, but isn't.

I think we can address the problem by allowing the developer to declare a file of paraphrases, and say that the example is semantically correct if it gives the same result as either the actual transcription or one of its paraphrases. Then if the developer adds in-coverage paraphrases where they exist, things will work correctly. This should be easy to implement. Probably we want a warning if a paraphrase in fact is also determined to be out of coverage.

This paraphrase functionality should also be useful for the N-best rescoring work that Maria and I have been doing for dialogue apps. We have the same problem there - we want to be able to experiment with out of coverage examples, but currently get no figures.

Wednesday, 11 June 2008

Interlingua corpora for multiple domains

Following a discussion with Pierrette last week, I have added two more MedSLT Interlingua corpora, for the chest pain and abdominal pain domains. I've also added all the associated config files, scripts etc for the currently relevant language pairs (EngInt, JapInt, IntEng, IntFre and IntJap), so it should now possible to do systematic interlingua-centered development for all three domains. I have only built AFF versions, since we're planning to retire the linear formalism soon.

The naming conventions are the usual ones. I hopefully managed to check everything in, but let me know if files that you expected to find are missing. Pierrette should at some point do some work tidying up IntFre and FreInt and Yukie should do the same for IntJap and JapInt. Further down the line, we should really add coverage for these domains in the missing languages.

Tuesday, 10 June 2008

AFF version of Catalan

I've added initial versions of all the files needed for the AFF version of Catalan in MedSLT. Naming conventions are the usual ones, and I was able to build all the AFF Cat resources by doing

make role_marked

in the Cat/scripts directory. There should now be config files for all 5 x 5 = 25 pairs of languages in {Ara, Cat, Eng, Fre, Jap} - this involved adding a few new pairs. I only tested Interlingua to Catalan and Catalan to Interlingua. We get currently translations for about 75% of the sentences in IntCat, and about 20% in CatInt. Hopefully it will be easy to improve these figures.

Over to Pierrette and Bruna to debug the rules. Note that I have macrotised the Cat lexicon to make the AFF version work. It should be mostly OK, but there were a few cases (in particular, WH+ PPs) where I wasn't quite sure how to do the macrotisation - people who actually know Catalan should review the entries.

Wednesday, 4 June 2008

Regulus 2.9.0 released

Nikos has just created and uploaded the new 2.9.0 release of Regulus. I tried downloading and running a couple of simple tests in text and speech mode (under SICStus 4.0.3), and Toy1 at least appears to work fine. Please mail me if you notice problems.

Here are the release notes:


A large number of new features have been added to Regulus since
version 2.8.0. Most importantly, Regulus now runs under Sicstus 4;
it is possible to use speech input directly from the top-level;
N-best processing is supported in both dialogue and translation mode;
and a new semantics for translation applications has been added.

The new features are listed below in more detail. Not all of them are
fully documented yet, but we are giving priority to adding the
necessary documentation.

- Support for Sicstus 4
- Regulus runs under Sicstus 4.
- It has been thoroughly tested under 4.0.2.
- Some testing has been done under 4.0.3, but this has not yet been carefully
verified. NOTE: under 4.0.3, it is necessary to load the patch files in
- Regulus still runs under Sicstus 3, and has been thoroughly
tested under 3.12.5.

- Top-level
- Errors are now written to stderr
- There is a version of regulus_batch with an extra argument, which returns
the list of error outputs created when running the commands.
- It is possible to compile Nuance grammars from the Regulus top-level
using the NUANCE_COMPILE command.
- It is possible to perform speech recognition directly from the top-level
- The LOAD_RECOGNITION command starts defined speech resources, including
a license manager, recserver and Regserver
- After loading resources using LOAD_RECOGNITION, the RECOGNISE command
takes live speech input and passes it to the current application.
- Wavfiles are automatically logged by RECOGNISE. The WAVFILES command
lists the most recent recorded wavfiles.
- When speech resources are loaded, text input of the form


performs recognition on , and passes the result to the
current application

- Java GUI
- The Java GUI has been greatly improved, and many bugs have been fixed.
- The GUI supports direct speech input, similar to the Prolog top-level
described above
- It is possible to run multiple copies of the GUI at the same time.

- Stepper
can be invoked from within the stepper.

- Support for spoken dialogue applications
- When speech resources have been loaded from the command line,
dialogue corpora can contain items of the form wavfile().
This makes it possible to test corpora containing a mixture of speech
and non-speech inputs.
- Batch processing of speech input in dialogue mode produces figures
for semantic error rate. An utterance is deemed semantically correct if
it produces the same dialogue move as the transcription would have done.
- A timeout has been added in batch dialogue processing, so that processing
gives up after 10 seconds.
- If N-best preferences are defined, preference info is printed in
dialogue mode.
- Allow dialogue server to take XML-formatted requests

- Generation
- When the declaration

regulus_config(prolog_semantics, yes).

is included, generation grammars can contain arbitrary Prolog structures.

- Translation
- There is extensive support for translation using both the original
"linear" semantics, and also the new "Almost Flat Functional" (AFF)
semantics. AFF is described in our COLING 2008 paper, which will soon
posted on the Regulus website. Some initial documentation will be added
to RegulusDoc.htm.
- It is possible in a translation config file to define an interlingua
as either a source or a target language. There are many examples
in the MedSLT project directory.
- Batch translation produces output files for judging both in Prolog
and in CSV form. There are new commands for updating judgements from the
CSV files.
- When speech resources have been loaded from the command line,
translation corpora can contain items of the form wavfile().
- A simple version of N-best processing has been added for applications
that use interlingual translation with an interlingua grammar. In N-best mode,
the first utterance producing well-formed interlingua is selected.
- Interlingua expressions ambiguous according to the interlingua grammar
are flagged in translation mode.
- If performing batch translation from Source to Target through Interlingua,
combine available Source -> Interlingua and Interlingua -> Target
judgements into Source -> Target judgements if possible.
- Show average number of generated target language surface forms when
doing batch translation.
- Translation conditions can include elements of the form


This matches an in a clause.

- Grammar specialisation
- Fix bug in processing of include_lex declarations.

- Help
- When defining intelligent help for translation applications, help resources
can be built from an interlingua corpus.

- Extension to Regulus grammar formalism
- Allow =@ as synonym for = @
- Add runtime support for GSL functions strcat/2, add/2, sub/2, neg/1, mul/2, div/2

- English grammar
- Rules for dates including years have been added.

- Other
- Tool added to perform random generation from PCFG-trained GSL grammars

Monday, 2 June 2008

Problems with SICStus 4.0.3 resolved

The SICStus people were as usual very responsive, and we now seem to be OK for running under 4.0.3. However, (this is IMPORTANT), you need to install a couple of patch files if you are using that version of SICStus. So, if you're using 4.0.3, do the following:
  • Update Regulus from CVS, using the -d option to get new directories.
  • Copy the files from Prolog/SicstusPatches/4.0.3 to C:/Program Files/SICStus Prolog 4.0.3/library, or wherever you have your copy of SICStus.
I will set my default version of SICStus to 4.0.3, which means I'll no doubt test it a fair amount over the next few days. I would not recommend people to switch over to 4.0.3 until I've run with it a while and reported on how it's working.

Problems with SICStus 4.0.3

We are unfortunately still having problems with SICStus 4. Things have been more or less stable with 4.0.2, but there were a few rather ugly patches - the SICStus people said things would be better in the next version. Sad to tell, I have just downloaded 4.0.3 and tried it out, and in fact, at least as far as Regulus is concerned, it's gone backwards. Due to new incompatibilities in the operating system interface libraries, it's not currently possible to run Regulus in speech mode with 4.0.3 - there may also be other problems. I can presumably implement a workaround, but the idea of having to patch the code after every new SICStus release makes me very nervous.

For Prolog people who want the low-level details, here is part of the mail I just sent to the SICStus team:

Unless I am misunderstanding something important, SP4.0.3's version of the
system3 library is still not downward-compatible with SP3's system library, and is in fact rather
less downward-compatible than SP4.0.2's system3. The problem is now in system/1.
In SP4.0.2, system/1 is defined as follows:

system(Cmd) :-
system_binary(Binary, DashC),
proc_call(Binary, DashC, Cmd, exit(0)).

so it's possible to make calls like the following, running under Cygwin:

| ?- system('dir > tmp_dir.txt').
1 1 Call: system('dir > tmp_dir.txt') ?
2 2 Call: system3:environ('COMSPEC',_790) ?
2 2 Exit: system3:environ('COMSPEC','C:\\WINDOWS\\system32\\cmd.exe') ?
3 2 Call: system3:process_create('C:\\WINDOWS\\system32\\cmd.exe',['/C','dir > tmp_dir.txt'],system3:[process(_1437)]) ?
3 2 Exit: system3:process_create('C:\\WINDOWS\\system32\\cmd.exe',['/C','dir > tmp_dir.txt'],system3:[process('$process'('$ptr IEDNJP'))]) ?
4 2 Call: system3:process_wait('$process'('$ptr IEDNJP'),exit(0)) ? s
4 2 Exit: system3:process_wait('$process'('$ptr IEDNJP'),exit(0)) ?
1 1 Exit: system('dir > tmp_dir.txt') ?

Under SP4.0.3, system/1 is defined thus:

system(Cmd, Status) :-
shell_exec(Cmd, [], exit(Status)).

and the corresponding call looks like this:

| ?- system('dir > tmp_dir.txt').
1 1 Call: system('dir > tmp_dir.txt') ?
2 2 Call: system3:system('dir > tmp_dir.txt',0) ?
3 3 Call: system3:process_create('dir > tmp_dir.txt',[],system3:[commandline(true),process(_1119)]) ?
3 3 Exit: system3:process_create('dir > tmp_dir.txt',[],system3:[commandline(true),process('$process'('$ptr ALJLOO'))]) ?
4 3 Call: system3:process_wait('$process'('$ptr ALJLOO'),exit(0)) ?
4 3 Fail: system3:process_wait('$process'('$ptr ALJLOO'),exit(0)) ?
2 2 Fail: system3:system('dir > tmp_dir.txt',0) ?
1 1 Fail: system('dir > tmp_dir.txt') ?

The problem, as far as I can see, is that process_create requires the first arg
of process_create to be a program, which it isn't here.

Unfortunately, we have people running Regulus under at least 3.12.5, 4.0.2 and 4.0.3.
Maintaining the code so that it runs under all these different versions is
becoming quite difficult - the operating system interface primitives are
absolutely essential. Advice appreciated.

Interlingua corpora

Over the last few months, we have been moving MedSLT development towards a new way of doing things, which is based on the idea of an "Interlingua corpus". We present the basic picture in our LREC 2008 paper, but that's already somewhat out of date, and doesn't give any low-level details.

We now have four interlingua corpora, representing the cross-produce of {linear, AFF} x {plain, combined}. The linear/AFF distinction is concerned with the type of semantics used. "Linear" is the old MedSLT semantics; AFF semantics is explained in the paper by Pierrette, Beth Ann, Yukie and myself which has just been accepted for COLING 2008, and which will soon be appearing on the Geneva website.

The plain/combined distinction says what information has been incorporated in the corpus. The "plain" corpus is created by merging results of translating FROM each source language into interlingua, so each interlingua form lists the source language results that translate into it. The "combined" corpus contains all the information in the "plain" corpus, plus also the results of translating TO each target language.

At the moment, we use the plain corpus for developing translations rules that go from Interlingua to target languages. The combined corpus is used for creating help resources.

All the scripts used to build interlingua corpora are referenced in $MED_SLT2/Interlingua/scripts/Makefile.

Sunday, 1 June 2008

Running multiple copies of the GUI

Elisabeth did a little work over the weekend, and it's now possible to run multiple copies of the GUI simultaneously - this is an important feature that people have been requesting for some time. The solution turns out to be embarrassingly simple. All we needed to do, in the end, was fix things so that it's possible for both the Java and the Prolog processes to specify from the command line which port they use to communicate with each other. As long as different {Java, Prolog} pairs use different ports, they don't interfere with each other.

I've added an example script to Regulus/Java called run_prolog_and_java2.bat - this is just like run_prolog_and_java.bat, but starts a second pair of processes, communicating over a new port.