lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Tokenization and PrefixQuery
Date Fri, 14 Feb 2014 11:33:49 GMT
On Fri, Feb 14, 2014 at 6:17 AM, Yann-Erwan Perio <ye.perio@gmail.com> wrote:
> Hello,
>
> I am designing a system with documents having one field containing
> values such as "Ae1 Br2 Cy8 ...", i.e. a sequence of items made of
> letters and numbers (max=7 per item), all separated by a space,
> possibly 200 items per field, with no limit upon the number of
> documents (although I would not expect more than a few millions
> documents). The order of these values are important, and I want to
> search for these, always starting with the first value, and including
> as many following values as needed: for instance, "Ae1", "Ae1 Br2"
> would be possible search values.
>
> At first, I indexed these using a space-delimited analyzer, and ran
> PrefixQueries. I encountered some performance issues though, so ended
> up building my own tokenizer, which would create tokens for all
> starting combinations ("Ae1", "Ae1 Br2"...), up to certain limit,
> called the analysis depth.

This is similar to PathHierarchyTokenizer, I think.

> I would then dynamically create TermQueries
> to match these tokens when searching under the analysis depth, and
> PrefixQueries when searching over the analysis depth (the whole string
> also being indexed as a single token). The performance was great,
> because TermQueries are very fast, and PrefixQueries are not bad
> either, when the underlying relevant number of documents is small
> (which happens to be the case when searching beyond the analysis
> depth). I have however two questions: one regarding the PrefixQuery,
> and one regarding the general design.
>
> Regarding the PrefixQuery: it seems that it stops matching documents
> when the length of the searched string exceeds a certain length. Is
> that the expected behavior, an if so, can I / should I manage this
> length?

That should not be the case: it should match all terms with that
prefix regardless of the term's length.  Try to boil it down to a
small test case?

> Regarding the general design: I have adopted an hybrid approach
> TermQueries/PrefixQueries, letting clients customize the analysis
> depth, so as to keep a balance between the performance and the size of
> the index. I am however not sure this is a good idea: would it be
> better to tokenize the full string (i.e. analysis depth is infinity,
> so as to only use TermQueries)? Or could my design be substituted by
> an altogether different, more successful analysis approach?

I think your approach is a typical one (adding more terms to the index
so you get TermQuery instead of MoreCostlyQuery).  E.g.,
ShingleFilter, CommonGrams are examples of the same general idea.
Another example is AnalyingInfixSuggester, which does the same thing
you are doing under-the-hood but one byte at a time (i.e. all term
prefixes up to a certain depth), and it also makes its analysis depth
controllable.  Maybe expose it to your users as a very expert tunable?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message