Tuesday, 28 October 2008

Improvement to dynamic lexicon functionality

I have just checked in some improvements to dynamic lexicons, which should considerably reduce the number of external files created at runtime. Hopefully this will improve recognition response times, but so far I don't have a non-trivial dynamic lexicon application to test on - so I would appreciate feedback from the Ford project. In particular, please let me know at once if anything appears to be broken. If necessary, you can reverse the change by reverting the file $REGULUS/Prolog/dynamic_lexicon.pl to the previous version.

Saturday, 25 October 2008

Default parse preferences for specialised grammars

Following a discussion with Pierrette, I have added default parse preferences for specialised grammars, based on the geometric mean of the rule frequencies as observed in the training corpus. This is what we have been doing for some time in generation. To get the new functionality, you need to update Regulus and remake the specialised grammar you are using. Most of the time, you shouldn't notice anything new, except that the rule frequencies will be displayed in the parse trees, as in the following example:

>> is it a sharp pain
(Parsing with left-corner parser)

Analysis time: 0.12 seconds

Return value: [(object=[adj,sharp]), (agent=[pronoun,it]), (object=[secondary_symptom,pain]),
(null=[tense,present]), (null=[utterance_type,ynq]), (null=[verb,be]),
(null=[voice,active])]

Global value: []

Syn features: []

Parse tree:

.MAIN (freq 836) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:2629-3470]
top (freq 830) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:3471-4306]
utterance (freq 622) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:4307-4934]
s (freq 31) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:11277-11313]
/ vbar (freq 461) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:5947-6413]
| / v lex(is) (freq 39) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:10819-10863]
| | np (freq 314) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:7213-7532]
| \ pronoun lex(it) (freq 53) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:10336-10394]
| tmp_cat_12 (freq 31) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:11314-11317]
| / np (freq 1153) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:1622-2628]
| | / np (freq 63) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:9998-10066]
| | | / d lex(a) (freq 86) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:9110-9201]
| | | | tmp_cat_6 (freq 63) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:10067-10070]
| | | | / adj lex(sharp) (freq 6) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:14728-14739]
| | | \ \ n lex(pain) (freq 389) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:6818-7212]
| | \ post_mods null (freq 1399) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:615-1621]
\ \ post_mods null (freq 1399) [MED_ROLE_MARKED_SPECIALISED_DEFAULT:615-1621]

------------------------------- FILES -------------------------------

MED_ROLE_MARKED_SPECIALISED_DEFAULT: c:/cygwin/home/speech/speechtranslation/medslt2/eng/generatedfiles/med_role_marked_specialised_default.regulus

Preference information:

1.80 Rule frequency score
Total preference score: 1.80

The bad news: I was hoping this would solve an annoying problem in Eng/Spa bidirectional. Unfortunately, it doesn't seem to do that. No idea why this used to work, in fact!

Sunday, 12 October 2008

Parsing non-top constituents (continued)

I have now checked in an improved version of the functionality for parsing non-top constituents, which hides the dummy rules and shows the features for the constituent. Here are a couple of examples from Toy1Specialised:

>> np the light in the kitchen
(Parsing with left-corner parser)

Analysis time: 0.55 seconds

Return value: [[device,light],[location,kitchen],[prep,in_loc],[spec,the_sing]]

Global value: []

Syn features: [agr=3/\sing,case=A,conj=n,def=y,gapsin=B,gapsout=B,n_appositive_mod_type=none,
n_of_mod_type=none,nform=normal,pronoun=n,sem_n_type=dimmable\/switchable,
syn_type=np_with_noun,takes_about_pp=n,takes_attrib_pp=n,takes_cost_pp=n,
takes_date_pp=n,takes_duration_pp=n,takes_frequency_pp=n,takes_from_pp=n,
takes_loc_pp=n,takes_partitive=n,takes_passive_by_pp=none,takes_post_mods=n,
takes_side_pp=n,takes_time_pp=n,takes_to_pp=n,takes_with_pp=n,wh=n]

Parse tree:

np [GENERAL_ENG:2026-2044]
/ np [GENERAL_ENG:1864-1874]
| / d lex(the) [GEN_ENG_LEX:341-344]
| | nbar [GENERAL_ENG:2071-2083]
| \ n lex(light) [TOY1_LEX:44-47]
| post_mods [GENERAL_ENG:1591-1680]
| / pp [GENERAL_ENG:1747-1765]
| | / p lex(in) [TOY1_LEX:51-58]
| | | np [GENERAL_ENG:2026-2044]
| | | / np [GENERAL_ENG:1864-1874]
| | | | / d lex(the) [GEN_ENG_LEX:341-344]
| | | | | nbar [GENERAL_ENG:2071-2083]
| | | | \ n lex(kitchen) [TOY1_LEX:38-39]
| | \ \ post_mods null [GENERAL_ENG:1410-1416]
\ \ post_mods null [GENERAL_ENG:1410-1416]

------------------------------- FILES -------------------------------
GENERAL_ENG: c:/cygwin/home/speech/regulus/grammar/general_eng.regulus
GEN_ENG_LEX: c:/cygwin/home/speech/regulus/grammar/gen_eng_lex.regulus
TOY1_LEX: c:/cygwin/home/speech/regulus/examples/toy1specialised/regulus/toy1_lex.regulus

>> n light
(Parsing with left-corner parser)

Analysis time: 0.02 seconds

Return value: [[device,light]]

Global value: []

Syn features: [agr=3/\sing,conj=n,n_appositive_mod_type=none,n_of_mod_type=none,
n_post_mod_type=none,n_pre_mod_type=loc,sem_n_type=dimmable\/switchable,
takes_about_pp=n,takes_attrib_pp=n,takes_cost_pp=n,takes_date_pp=n,takes_det_type=def,
takes_duration_pp=n,takes_frequency_pp=n,takes_from_pp=n,takes_loc_pp=y,
takes_partitive=n,takes_passive_by_pp=none,takes_side_pp=n,takes_time_pp=n,
takes_to_pp=n,takes_with_pp=n]

Parse tree:

n lex(light) [TOY1_LEX:44-47]

------------------------------- FILES -------------------------------

TOY1_LEX: c:/cygwin/home/speech/regulus/examples/toy1specialised/regulus/toy1_lex.regulus

Thursday, 9 October 2008

Parsing non-top constituents

Following a conversation with Pierrette last week, I realised that there was an easy way to fix things so that we can parse non-top constituents in the LC (normal) parser, as well as the DCG one. I have just checked in a first version of the new functionality. Now, when you load a grammar using the LOAD command, an extra file of dummy rules is created and added to the ones explicitly specified. There is one dummy rule for each category Cat in the grammar, of the form

dummy_top:[sem=Sem] --> Cat, Cat:[sem=Sem]

For example, the dummy rule for 'np' is

dummy_top:[sem=Sem] --> np, np:[sem=Sem]

What this means is that you can now parse NPs at top-level by simply prefacing them with the word 'np'. Thus for instance in Calendar we can do things like the following:

>> np the last meeting in geneva
(Parsing with left-corner parser)

Analysis time: 0.97 seconds

Return value: [[at_loc,[[spec,name],[head,geneva]]],[head,meeting],[spec,the_last]]

Global value: []

Syn features: []

Parse tree:

.MAIN [CALENDAR_DUMMY_TOP_LEVEL_RULES:1-1]
dummy_top [CALENDAR_DUMMY_TOP_LEVEL_RULES:22-22]
/ lex(np)
| np [GENERAL_ENG:2026-2044]
| / np [GENERAL_ENG:1849-1863]
| | / d lex(the) lex(last) [GEN_ENG_LEX:355-355]
| | | nbar [GENERAL_ENG:2071-2083]
| | \ n lex(meeting) [CALENDAR_LEX:88-89]
| | post_mods [GENERAL_ENG:1591-1680]
| | / pp [GENERAL_ENG:1747-1765]
| | | / p lex(in) [CALENDAR_LEX:151-151]
| | | | np [GENERAL_ENG:1955-1963]
| | | \ name lex(geneva) [GENERATED_NAMES:41-41]
\ \ \ post_mods null [GENERAL_ENG:1410-1416]

------------------------------- FILES -------------------------------

CALENDAR_DUMMY_TOP_LEVEL_RULES: c:/cygwin/home/speech/regulus/examples/calendar/generated/calendar_dummy_top_level_rules.regulus
CALENDAR_LEX: c:/cygwin/home/speech/regulus/examples/calendar/regulus/calendar_lex.regulus
GENERAL_ENG: c:/cygwin/home/speech/regulus/grammar/general_eng.regulus
GENERATED_NAMES: c:/cygwin/home/speech/regulus/examples/calendar.regulus
GEN_ENG_LEX: c:/cygwin/home/speech/regulus/grammar/gen_eng_lex.regulus

Semantic triples: []

No preferences apply

I should be able to improve this a little, in particular by adding some functionality to display the features on the non-top constituent as well as the semantics, but hopefully the existing version will already be quite useful.