Wednesday 13 January 2010

A Scandinavian/English grammar, continued

More progress on the Scandinavian/English grammar:
  • I've now got quite reasonable initial rules for negation and lexically passivized verbs, two of the largest holes in the Swedish. Negation, in Swedish, is just an adverb. I did however have to add a feature which distinguished main clauses from subordinate clauses, since the negation adverb occurs after the verb in a main clause, and before it in a subordinate clause. It turned out to be easy to adapt the existing rules for passives to handle lexical passives as well: the grammar now allows passive versions of the present, imperfect, supine and infinitive forms. Thus for example den kan inte köpas här, "it can not buy-INF-PASSIVE here" = "it can not be bought here".
  • I've improved the coverage in the Swedish version of MedSLT. I can now translate 92% of the combined interlingua corpus into Swedish, and translate back 99% of the results. This mostly involved adding new lexical items and transfer rules, though I also had to make a couple of minor adjustments to the grammar.

Friday 8 January 2010

Scandinavian/English grammar now used for English MedSLT

I've now changed the config files for English MedSLT so that they use the shared Scandinavian/English grammar rather than old English-only grammar. I found a couple of small bugs, but now everything seems to be working fine again.

Some time soon, Nikos and I should get together and figure out how to add Swedish to the MedSLT demo and the nightly build. It shouldn't be at all hard.

Saturday 2 January 2010

Swedish MedSLT, continued

I temporarily broke off working on Interlingua to Swedish, and spent a day concentrating on the opposite direction. I built the recognition grammar by training on a corpus which was the union of the original recognition corpus and the generation corpus (this ensures that everything you can generate will also get recognized); then I did PCFG tuning using the set of translations produced from the combined Interlingua corpus. I also used the set of translations as the initial Swedish corpus for translation testing. All the corpora concerned are created on-the-fly as part of the make process, so the correspondences will stay up to date.

It was easy to get things working in Swe -> Int direction, and 98% of the translation corpus now produces well-formed interlingua. I compiled a Swedish recognizer, and hooked everything together to get a speech-to-speech system for Swedish -> English. Anecdotally, it's not bad.

The most urgent thing now is probably to add more Swedish coverage. There are several very common constructions that currently aren't in the specialized Swedish grammar.