opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <>
Subject Re: Should entries in the abbreviation dictionary include '.' ?
Date Mon, 19 Mar 2012 08:42:36 GMT
On 03/16/2012 09:47 PM, wrote:
> Hi,
> Should entries in the abbreviation dictionary include '.' ?
> The one included for unite test includes:
> If we include the EOS character not all features are collected properly.
> The most important issue is here:
>        if (inducedAbbreviations.contains(prefix)) {
>          collectFeats.add("xabbrev");
>        }
> if we include the EOS in the dictionary entries this feature will
> never be collected.
> On the other hand we also have the following:
>        if (inducedAbbreviations.contains(previous)) {
>          collectFeats.add("vabbrev");
>        }
> This would fail if the previous token is an abbreviation and the abb
> dictionary does not include EOS characters.
> I would change the code to pass the EOS character as argument to the
> collectFeatures method. What do you think?

Abbreviations often can be written with dots or without. Maybe we should
make a small utility method which removes all non-letters and use a 
dictionary to match the token. The same method could be run over the 
dictionary before
it is used.

What do you think?
What happens if there is a comma?

Maybe we get better results when the dictionary feature is also combined
with other features, e.g the next initial capital feature.


View raw message