Thanks for the thorough analysis Jake.
This makes intuitive sense. Early on, there are many insertions to
the PQ, but over time, it "converges" and the insertions eventually
become rare.
So, the total number of insertions that must be handled (= the
subleading cost) will necessarily be higher for the multiPQ case,
because each PQ in the multiPQ case must separately converge per
segment. The singlePQ converges faster.
But that cost becomes nearly a constant once the number of hits gets
high enough (ie, insertions become rare), and so is a vanishing
contribution vs the overall leading S*H cost.
With a low number of hits, both single and multi PQ are handling many
insertions. Going to a higher number of hits, single PQ has
converged but multi PQ hasn't and is thus more costly. Going to an
even higher number of hits, both have converged and the insertion cost
of either becomes "in the noise".
I'm testing on 5M doc indexes (wikipedia & random) now to test queries
with more hits.
Also, I think the constant factors for singlePQ insertion are higher,
because it has to manage indirection (slots) and so must use a
separate class to track that (Entry), plus it makes 4 method calls in
the insertion case (compareBottom, copy, updateBottom, setBottom).
Mike
On Mon, Oct 26, 2009 at 1:14 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>> Also, I really do not like the large perf hits at highish topN sizes.
>> Admittedly it's unusual (though I suspect not THAT rare) to use such
>> a large topN, but, I esepically don't like that it doesn't degrade
>> gracefully – it's surprising. Likewise I'd like to see with highish
>> number of segments how things degrade.
>
> Well the degradation should not be surprising  it's coded into the
> subleading complexity: redoing the complexity argument carefully
> for the average case, probabilistically, the flop count for single PQ,
> for top T docs out of S segments of H hits in each, is^(footnote1)
>
> O(S*H + T*log(T)*log(S*H))
>
> (note that it is a *sum* of the two terms)
>
> while inserting in S different queues is^(footnote2)
>
> O(S*H + T*log(T)*log(H)*S )
>
> This means that multiPQ has a subleading term with
> S*log(H) / log(S*H) worse performance, as a function of T.
>
> To see whether this is "bad", we have to consider the full
> range of S = numSegments, H = numHitsPerSegment
> and T = topNumHitsToReturn.
>
> If you fix S and H (which fixes the dominating factor as
> constant also), then as T increases, as the tests show,
> multiPQ diverges, getting worse as T goes up.
>
> On the other hand, if you fix T, and instead let S and H
> vary, then the flopcount ratio is the following:
>
> flops_multiPQ / flops_singlePQ =
>
> S*H + T*log(T)*log(S*H)
>  =
> S*H + T*log(T)*log(H)*S
>
> 1 + T*log(T)*log(S*H)/S*H
>  =
> 1 + T*log(T)*log(H)*S/S*H
>
> under the always true (for us) case that S*H is the
> dominating factor, and using 1/(1+x) =~ 1x, for small
> x, we have:
>
> (1 + T*log(T)*log(S*H)/S*H ) * (1  T*log(T)*log(H)/H )
>
> using (1+x)*(1y) = 1+xy for small x, y, and pulling out
> the common factor of T*log(T)/H, we have:
>
> 1  ( T*log(T)/H ) * ( log(H)  (log(S*H) / S) )
>
> What's the end result? If you fix the queue size (T),
> then as the number of hits and the number of hits
> per segment grows, the percentage performance difference
> should go to *zero* proportional to 1/H =
> 1/numHitsPerSegment.
>
> This to me is pretty important to note, and should be
> pretty easy to see in a test run: doing it on a run with
> 5 million documents hit by the query should work out
> much better for the multiPQ than singlePQ, whereas
> *decreasing* the hit count, to say 10,000 hits, should
> penalize multiPQ versus singlePQ.
>
> The average user I would imagine doesn't have *tons*
> of variation in queuesize or number of segments
> (meaning I doubt either grows often above 100, and
> very rarely up to 1000), but the variation in total
> number of hits is extremely variable, and as even
> recent discussions on javauser show, people even
> look at multiTB indexes, so performance as a function
> of index size is probably the most important.
>
> I can run these numbers on my mac desktop, because
> this analysis is independent of hardware or even
> java version, since it's a scaling argument, but we should
> really see what the results are like on reasonable
> hardware for both the small hitcount case (where the
> performance might be really bad for singlePQ, and
> rule it out altogether), and the very high hitcount case
> (where the difference should be much more in favor
> of multiPQ, and in fact could rule out singlePQ if the
> constant factors (like the cost of the more string
> compares that singlePQ has to do) become more
> important than the complexity, because the asymptotics
> are the same in this regime).
>
> jake
>
> 
>
> (1) To see this scaling, if you have a queue of size
> T, and you have currently looked at X hits, then the
> probability in the average case of the next hit beating
> the bottom element of the queue is T/X. Then the
> total number of insertions in the queue you have to
> do in a set of H hits is sum_X=1...H [ T/X ]. Replacing
> the sum with an integral, it's just T log(H). The
> cost of an insertion into a PQ of size T is log(T), giving
> the flop count based on insertions of log(T) * T * log(H)
> for insertions while walking H hits, and log(T)*T*log(S*H)
> for insertions while walking S*H hits.
>
> (2) we're neglecting the final merge for the multiPQ case,
> because its subsubleading in complexity: merging
> S sorted lists of T elements is O(T log(S)) on average,
> which is smaller than the next bigger term in this counting
> by a multiplicative factor of log(T) * log(H).
>

To unsubscribe, email: javadevunsubscribe@lucene.apache.org
For additional commands, email: javadevhelp@lucene.apache.org
