lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yann-Erwan Perio <>
Subject Re: Tokenization and PrefixQuery
Date Fri, 14 Feb 2014 12:11:16 GMT
On Fri, Feb 14, 2014 at 12:33 PM, Michael McCandless
<> wrote:

> This is similar to PathHierarchyTokenizer, I think.

Ah, yes, very much. I'll check it out and see if I can make something
of it. I am not sure to what extent it'll be reusable though, as my
tokenizer also sets payloads (the next coming "path part" is set on
the current token as a payload, so as to provide a perspective of
what's coming ahead, at search time).

>> Regarding the PrefixQuery: it seems that it stops matching documents
>> when the length of the searched string exceeds a certain length. Is
>> that the expected behavior, an if so, can I / should I manage this
>> length?
> That should not be the case: it should match all terms with that
> prefix regardless of the term's length.  Try to boil it down to a
> small test case?

I guess I've been too shallow with my testing, then :( Well, I'll dig
deeper, and if I find something wrong with Lucene, I'll post a small
test case demonstrating the issue - but so far, the errors were always
on my side.

> I think your approach is a typical one (adding more terms to the index
> so you get TermQuery instead of MoreCostlyQuery).  E.g.,
> ShingleFilter, CommonGrams are examples of the same general idea.
> Another example is AnalyingInfixSuggester, which does the same thing
> you are doing under-the-hood but one byte at a time (i.e. all term
> prefixes up to a certain depth), and it also makes its analysis depth
> controllable.  Maybe expose it to your users as a very expert tunable?

This is what I have done, letting the clients of the framework specify
the analysis depth through their configuration file.

Thanks a lot for your feedback, it's very appreciated.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message