lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (Assigned) (JIRA)" <>
Subject [jira] [Assigned] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters
Date Tue, 03 Apr 2012 16:58:23 GMT


Uwe Schindler reassigned LUCENE-3907:

    Assignee: Uwe Schindler

I would like to be the mentor for this. I wanted to fix those long time ago and I am happy
somebody helps.

P.S.: Maybe we also get a new ShingleMatrix *LOL*
> Improve the Edge/NGramTokenizer/Filters
> ---------------------------------------
>                 Key: LUCENE-3907
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Uwe Schindler
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
> Our ngram tokenizers/filters could use some love.  EG, they output ngrams in multiple
passes, instead of "stacked", which messes up offsets/positions and requires too much buffering
(can hit OOME for long tokens).  They clip at 1024 chars (tokenizers) but don't (token filters).
 The split up surrogate pairs incorrectly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message