opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Should entries in the abbreviation dictionary include '.' ?
Date Mon, 19 Mar 2012 08:42:36 GMT
On 03/16/2012 09:47 PM, william.colen@gmail.com wrote:
> Hi,
>
> Should entries in the abbreviation dictionary include '.' ?
>
> The one included for unite test includes:
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=co
>
> If we include the EOS character not all features are collected properly.
>
> The most important issue is here:
>
>        if (inducedAbbreviations.contains(prefix)) {
>          collectFeats.add("xabbrev");
>        }
>
> if we include the EOS in the dictionary entries this feature will
> never be collected.
>
> On the other hand we also have the following:
>
>        if (inducedAbbreviations.contains(previous)) {
>          collectFeats.add("vabbrev");
>        }
>
> This would fail if the previous token is an abbreviation and the abb
> dictionary does not include EOS characters.
>
> I would change the code to pass the EOS character as argument to the
> collectFeatures method. What do you think?

Abbreviations often can be written with dots or without. Maybe we should
make a small utility method which removes all non-letters and use a 
case-insensitive
dictionary to match the token. The same method could be run over the 
dictionary before
it is used.

What do you think?
What happens if there is a comma?

Maybe we get better results when the dictionary feature is also combined
with other features, e.g the next initial capital feature.

Jörn

Mime
View raw message