lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters
Date Thu, 29 Mar 2012 16:20:28 GMT


Michael McCandless commented on LUCENE-3907:

Awesome!  We just need a possible mentor here... volunteers...?
> Improve the Edge/NGramTokenizer/Filters
> ---------------------------------------
>                 Key: LUCENE-3907
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
> Our ngram tokenizers/filters could use some love.  EG, they output ngrams in multiple
passes, instead of "stacked", which messes up offsets/positions and requires too much buffering
(can hit OOME for long tokens).  They clip at 1024 chars (tokenizers) but don't (token filters).
 The split up surrogate pairs incorrectly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message