lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <>
Subject Re: Is there some sensible way to do giant BooleanQuery or similar lazily?
Date Mon, 03 Apr 2017 08:25:16 GMT
Large boolean queries can cause a lot of random access as each sub clause
is advanced one after the other. Even in the case that everything fits in
the filesystem cache, the fact that the heap needs to be rebalanced after
each documents makes queries on many clauses slow. This is why we have
TermInSetQuery (TermsQuery on 6.x): it has a more disk-friendly access
pattern (1 seek per term per segment) and scales better with the number of
terms. Unfortunately it does not only come with benefits and its main
drawback is that it is always evaluated againts the entire index. So if you
intersect a very selective query (on an id field for instance) with a large
TermInSetQuery, the TermInSetQuery will dominate the execution time for

Le lun. 3 avr. 2017 à 03:18, Trejkaz <> a écrit :

> Hi all.
> We have this one kind of query where you essentially specify a text
> file which contains the actual query to search for. The catch is that
> the text file can be large.
> Our custom query currently computes the set of matching docs up-front,
> and then when queries come in for one LeafReader, the larger doc ID
> set is sliced so that the sub-slice for that leaf is returned. Which
> is confusing, and seems backwards.
> As an alternative, we could override rewrite(IndexReader) and return a
> gigantic boolean query. Problems being:
>   1) A gigantic BooleanQuery takes up a lot more memory than a list of
> query strings.
>   2) Lucene devs often say that gigantic boolean queries are bad,
> maybe for reason #1, or maybe for another reason which nobody
> understands
> So in place of this, is there some kind of alternative?
> For instance, is there some query type where I can provide an iterator
> of sub-queries, so that they don't all have to be in memory at once?
> The code to get each sub-query is always relatively straight-forward
> and easy to understand.
> I guess the snag is that sometimes the line of text is natural
> language which gets run through an analyser, so we'd potentially be
> re-analysing the text once per leaf reader? :/
> This would replace about 1/3 of the remaining places where we have to
> compute the doc ID set up-front.
> TX
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message