lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Hasselbaum <...@hasselbaum.net>
Subject NGramTokenFilter filters out small tokens?
Date Thu, 15 Dec 2011 15:13:19 GMT
Hi. I'm trying to configure an analyzer to be somewhat forgiving of
spelling mistakes in longer words of a search query. So, for example, if a
word in the query matches at least five characters of an indexed word
(token), I want that to be a hit. NGramTokenFilter with a minimum gram size
of 5 seems perfect for this. However, I just discovered that any tokens
less than 5 characters are being completely filtered out. So queries
containing words of < 5 characters are not matching anything at all. At
first I thought this was a bug, but then I found LUCENE-1491, which
indicates this is actually the intended behavior. Hmmm... How then should I
configure my analyzer to support exact matches on words <= 5 characters and
partial matches on words > 5? I guess I could develop my own token filter
based on NGramTokenFilter, but my requirements seem so basic that I'm
probably missing a simpler answer, Any help greatly appreciated!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message