lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastiano Vigna <vi...@di.unimi.it>
Subject Re: Interleaving and new Lucene formats
Date Sat, 16 Feb 2013 13:19:11 GMT
On 16 February 2013 13:19, Robert Muir <rcmuir@gmail.com> wrote:

I think you are missing my point: this interleaving is part of the
> whole design of this postings format. You can't just turn it off and
> force it to be always FOR: or you would need a new postings format
>

I never asked for that. It looks like you're entirely missing my point.
Which is to do a fair benchmark between radically different implementations
of an index structure.


> Thats right. Also keep in mind: in the FOR case the blocks themselves
> are interleaved, so you have a block of 128 doc deltas, then a block
> of 128 freqs follow, then 128 doc deltas again, then 128 freqs.
> finally the vint remainder is docs+freqs interleaved as vints.
>

OK, we are slowly getting there.

So the question is: do you decode interleaved freqs blocks *always*, or do
you do it *lazily* when freqs are actually used?

My only concern, again, is to do a fair benchmark against a non-interleaved
index. Which means that I have to force freqs reading on the
non-interleaved index, or I would penalize Lucene's structure. On the other
hand, if Lucene does not decode freqs when this is not necessary, I would
penalize the non-interleaved index by forcing freqs reads (I am using
a NullCollector to avoid any kind of scoring).

If both indices do not read freqs when this is not necessary, then I'm also
interested in forcing a count read to test the speed of access to counts
during conjunctive queries.

The alternative is doing tests only on phrasal and span queries, where
everybody has to get the same data anyway, but I think this misses to
detect some aspects because there's a lot of CPU work and sequential
position reading.

Mime
View raw message