lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reinardus Surya Pradhitya (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters
Date Thu, 29 Mar 2012 15:24:28 GMT


Reinardus Surya Pradhitya commented on LUCENE-3907:


I'm interested in this project. I have done a Natural Language Processing project in language
classification in which I did tokenization using Stanford's NLP tool. I'm also currently doing
an Information Retrieval project in documents indexing and searching using Lucene and Weka.
I might not be too familiar with Lucene's ngram tokenizer, but I have been working with NGram
and Lucene before, so I believe that I would be able to learn quickly. Thanks :)

Best regards,
Reinardus Surya Pradhitya
> Improve the Edge/NGramTokenizer/Filters
> ---------------------------------------
>                 Key: LUCENE-3907
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.0
> Our ngram tokenizers/filters could use some love.  EG, they output ngrams in multiple
passes, instead of "stacked", which messes up offsets/positions and requires too much buffering
(can hit OOME for long tokens).  They clip at 1024 chars (tokenizers) but don't (token filters).
 The split up surrogate pairs incorrectly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message