opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: AlphaNumOpt in tokenizer
Date Thu, 04 Aug 2011 14:52:03 GMT
On Thu, Aug 4, 2011 at 7:58 AM, Jörn Kottmann <> wrote:

> Hi William,
> I saw your change to the alpha num optimization in the
> tokenizer.
> I am aware of the fact that it is not perfect currently, especially
> for non-english languages. In my opinion we should use unicode
> to determine what is a letter and what is a numerical.
> Since it is a performance optimization I think we should
> undo the change you made and rather look into the unicode approach.
> What do you think?

+1, but I don't know about the uincode approach.

> We might want more options anyway, e.g. a tokenization dictionary for
> some frequent cases. In such a dictionary the tokenizer could lookup how
> a certain input char sequence should be tokenized.

Yes. The F score of the models I create using OpenNLP tokenizer is high
(>99%), but it fails in some cases, maybe because my training data don't
have enough of these cases.
I added the abbreviation dictionary, but it is not helping that much.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message