lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin <>
Subject A full-text tokenizer for the NGramTokenFilter
Date Sat, 17 Jul 2010 20:29:23 GMT
Hi there,

I have been recently trying to build a lucene index out of ngrams and 
seem to have stumbled on to a number of issues. I first tried to use the 
NGramTokenizer, but that thing apparently only takes the first 1024 
characters to tokenize. Having searched around the web, I came upon this 
issue being discussed a couple of years ago and the proposed solution 
there has been using the NGramTokenFilter. Now that filter certainly 
works, but it needs an underlying tokenizer to work with, and I'm just 
wondering if there is a tokenizer that would return me the whole text. 
The reason I can't use something like the StandardTokenizer is that 
ngrams should really include spaces and pretty much every tokenizer gets 
rid of them.

Thank you very much in advance for any suggestions.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message