lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Payloads and TrieRangeQuery
Date Wed, 10 Jun 2009 19:35:55 GMT
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler<> wrote:

>> I wonder how performance would compare.  Without payloads, there are
>> many more terms (for the tiny ranges) in the index, and your OR query
>> will have lots of these tiny terms.  But then these tiny terms don't
>> hit many docs, and with BooleanScorer (which we should switch to for
>> OR queries) ought not be very costly.
> That ist true. The main idea was to limit also seeking during the query.
> When splitting the range, you need to often start new TermEnums and iterate
> over lot of term. By catching many docs with less terms, you only need to
> scan forward in the payloads.

OK, though we should separately test "cold" searches (seeking matters)
and "hot" searches (seeking doesn't).  And we should separately test
SSD vs spinning drive for the cold case.  Seeking is much less costly
(though still more costly than "hot" searches) with SSDs...

>> Vs w/ payloads having to use
>> TermPositions, having to load, decode & check the payload, and I guess
>> assuming on average that 1/2 the docs are filtered out.
> Maybe decoding the payload is not needed, I would encode the bounds as
> byte[] and compare the arrays. But you would filter about half of the docs
> out.

Yonik's idea (encoding in the position) seems great here.

> My problem with all this is how to optimize after which shift value to
> switch between terms and payloads.

Presumably you'd "roughly" balance seek time vs "wasted doc filtered
out" time, to set the default, and make it configurable.

> And this information about the trie
> structure and where payloads are should be stored in FieldInfos.
> As we now search on each segment separately, this information can be stored
> per segment and also used for each per-segment Filter/Scorer.

Right, I think it should, but I agree w/ Yonik (partially) that it's orthogonal.

> The whole thing works out of the box with TrieRangeFilter (its just
> iterating over terms, getting TermDocs/TermPositions and setting bits, when
> payloads available after checking these), for TrieRangeQuery using
> BooleanQuery it is more complicated (MTQ cannot simply add the terms from
> the FilteredTermEnum to a BooleanQuery).

Seems like we should generalize MTQ so that the subclass could return
which clause should be added for each term, to the BQ?  (We are also
still needing to improve MTQ to decouple constant-scoring from "use BQ
or filter"... there's an issue opened for that).

> Until now I had no time to think about it in detail, but with maybe the
> possibility to have TrieRange in Core and store trie-specific FieldInfos per
> segment, I will get clearer how to manage this in the API.

I'd really like to see TrieRange in core for 2.9...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message