opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: SentenceDetectorME.train API change
Date Thu, 21 Jul 2011 14:38:37 GMT
On 7/21/11 4:13 PM, william.colen@gmail.com wrote:
> Should I just change the parameters order?
>
> For some reason the API was using a Dictionary to represent the abbreviation
> dictionary, but it was never used in the default context generator.
> Initially I was thinking about using this Dictionary implementation, but
> according to DefaultSDContextGenerator an abbreviation dictionary should
> implement Set<String>  and since Dictionary was already implementing
> Iterable<StringList>  it can't also implement Set<String>.
>
> Another option should be to remove the new AbbreviationDictionary class and
> try to use Dictionary instead. Maybe adding a method "asStringSet()" that
> creates a Set<String>  from the Dictionary and we can pass it to the context
> generator.
>
> What do you think?

The Dictionary is similar to the new Abbreviation Dictionary, but 
additionally supports
storing entries which consist of multiple tokens.

Do we have multi token abbreviations? If yes, we should use Dictionary.
Otherwise we could still use it, then the tokenizer could have a small util
method to turn a Dictionary into a Set<String>.

Reusing the Dictionary makes a few things easier because we do not have
to duplicate them.

We can also change the DefaultSDContextGenerator, if that is more 
convenient.

Jörn



Mime
View raw message