opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "william.colen@gmail.com" <william.co...@gmail.com>
Subject Re: AlphaNumOpt in tokenizer
Date Thu, 04 Aug 2011 14:52:03 GMT
On Thu, Aug 4, 2011 at 7:58 AM, Jörn Kottmann <kottmann@gmail.com> wrote:

> Hi William,
>
> I saw your change to the alpha num optimization in the
> tokenizer.
>
> I am aware of the fact that it is not perfect currently, especially
> for non-english languages. In my opinion we should use unicode
> to determine what is a letter and what is a numerical.
>
> Since it is a performance optimization I think we should
> undo the change you made and rather look into the unicode approach.
>
> What do you think?
>

+1, but I don't know about the uincode approach.


>
> We might want more options anyway, e.g. a tokenization dictionary for
> some frequent cases. In such a dictionary the tokenizer could lookup how
> a certain input char sequence should be tokenized.
>

Yes. The F score of the models I create using OpenNLP tokenizer is high
(>99%), but it fails in some cases, maybe because my training data don't
have enough of these cases.
I added the abbreviation dictionary, but it is not helping that much.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message