Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of john.wang@gmail.com designates
 209.85.217.219 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=qx+13R6JY14kl5e4iVxZn0yGUkkjH7oIaj9x4TT+ydO4LJ2yz1T6T5R+LeV/AK/raI
         km9Zjpa8oy38d8LVRJjhghz4uJPhgRZ68ceq14SfljAGcsQmzc4ZJ11hLLWbWQSLdZNj
         XBhOTvrDU7SsG0XbwcEeIrKKe2bf27BBZA9Lo=
MIME-Version: 1.0
In-Reply-To: <c68e39170910222030o162366bcwfbcdc2597199f21d@mail.gmail.com>
References: <8837fb770910142012w5f11ba57y99aeb18493603813@mail.gmail.com>
	 <9ac0c6aa0910210311nffa18c4gd3acbf0e73ef7b58@mail.gmail.com>
	 <8837fb770910212317g5a8c09ecid1810f301087583c@mail.gmail.com>
	 <9ac0c6aa0910220238l5cdce87g314ea18cd85bd695@mail.gmail.com>
	 <8837fb770910221837p287db022i6ed18349977381c6@mail.gmail.com>
	 <4AE114FA.7050309@gmail.com>
	 <8837fb770910221935t6e744eedi2cc129f5df17d508@mail.gmail.com>
	 <c68e39170910222006pcf28e99w4f88c86ceeae2e4d@mail.gmail.com>
	 <4b124c310910222011x10d03a7v608a170340880c75@mail.gmail.com>
	 <c68e39170910222030o162366bcwfbcdc2597199f21d@mail.gmail.com>
Date: Thu, 22 Oct 2009 22:08:20 -0700
Message-ID: <8837fb770910222208y50704ae9kfe225e1a5a9d287f@mail.gmail.com>
Subject: Re: lucene 2.9 sorting algorithm
From: John Wang <john.wang@gmail.com>
To: java-dev@lucene.apache.org, yonik@lucidimagination.com
Content-Type: multipart/alternative; boundary=001636b2b5255c5da904769333f7

--001636b2b5255c5da904769333f7
Content-Type: text/plain; charset=ISO-8859-1

Hi Yonik:
    I have been head deep in this trying to find out a good solution for
better part of the past two days, it's been hard because there are so many
variables:

1) how optimized are the code from either of the implementations
2) VM difference
3) HW etc.

    Also, there are quite a few dimensions this issue is being discussed on:

Algorithm:

    I think we should NOT jump to the conclusion that my number on the
multiQ is valid until others reproduce it (which is one of the reason I
asked mike to run his benchmark again with 1.5) I am gonna try to run it on
server machines when I get back to my office next week.

    Overall, I think the single Q algorithm is better. (It however does pay
a price for some string compares etc.), Its benefit becomes more and more
significant when the product of PQ size and segment count increases, which
makes complete sense from the algorithm. However, when PQ size is small
(which is in most of the cases, the multiplier on the segment count is also
small) the benefit is not as obvious. And sometimes the trade-off for the
constant string compare cost may not be worth it. (this remains a
hypothesis)

    With Java 1.6, maybe the singleQ approach is a winner in all cases.

     I will spend more time to find out a more definitive answer.

API:

    The new FieldComparator API is not difficult to understand (especially
for Lucene experts such as yourselves), but it is more involved in
comparison to the ScoreDocComparator API. I think anyone would agree with
that. Furthermore, when implementing some custom comparators, (examples I
have given earlier in this thread), it can be difficult to implement while
maintaining performance.

    I understand changing API is hard, that is why I am trying to raise this
as soon as possible, and it could very well be that the current API is fine.

    Lucene's collector api allows anyone to plugin any sorting algorithm,
kinda like what Mike has done with the tests. So it is ok if an API selected
does not fit the needs for everyone.

    In conclusion, please understand I am not trying to be "right" on this,
just trying to learn and to understand, which I did from reading and trying
to understand the code, along with guidance from Mike and Yonik and I am
more than impressed with the thoughts and code tuning that went into it.

Thanks

-John

On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley <yonik@lucidimagination.com>wrote:

> On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> > It's hard to read the column format, but if you look up above in the
> thread
> > from tonight,
> > you can see that yes, for PQ sizes less than 100 elements, multiPQ is
> > better, and only
> > starts to be worse at around 100 for strings, and 50 for ints.
>
> Ah, OK, I had missed John's followup with the numbers.
>
> I assume this is for Java5 + optimizations?
> What does Java6 show?
>
> bq. Point 2 does indeed make a difference, we've seen it, and it's
> only fair: the single pq comparator does this branch optimization but
> the current patch multi-pq does not, so let's level the playing field.
>
> Of course - it's not about leveling the playing field, but finding the
> best solution for the average case - so everything should be optimized
> as much as possible.  There are probably further optimizations
> possible in both the single and multi PQ cases.
>
> My biggest reservation is that we've gone down the road of telling
> people to implement a new style of comparators, and told them that the
> old style comparators would be deleted in the next release (which is
> where we are).  Reversing that will be a bit of a headache/question...
> the new stuff isn't deprecated, and having *both* isn't desirable, but
> that's a separate decision to be made apart from performance testing.
>
> Is there also an option of using a multiPQ approach with the new style
> comparators?
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

--001636b2b5255c5da904769333f7
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Yonik:<div><br></div><div>=A0=A0 =A0I have been head deep in this trying=
 to find out a good solution for better part of the past two days, it&#39;s=
 been hard because there are so many variables:</div><div><br></div><div>1)=
 how optimized are the code from either of the implementations</div>
<div>2) VM difference</div><div>3) HW etc.</div><div><br></div><div>=A0=A0 =
=A0Also, there are quite a few dimensions this issue is being discussed on:=
</div><div><br></div><div>Algorithm:</div><div><br></div><div>=A0=A0 =A0I t=
hink we should NOT jump to the conclusion that my number on the multiQ is v=
alid until others reproduce it (which is one of the reason I asked mike to =
run his benchmark again with 1.5) I am gonna try to run it on server machin=
es when I get back to my office next week.</div>
<div><br></div><div>=A0=A0 =A0Overall, I think the single Q algorithm is be=
tter. (It however does pay a price for some string compares etc.), Its bene=
fit becomes more and more significant when the product of PQ size and segme=
nt count increases, which makes complete sense from the algorithm. However,=
 when PQ size is small (which is in most of the cases, the multiplier on th=
e segment count is also small) the benefit is not as obvious. And sometimes=
 the trade-off for the constant string compare cost may not be worth it. (t=
his remains a hypothesis)</div>
<div><br></div><div>=A0=A0 =A0With Java 1.6, maybe the singleQ approach is =
a winner in all cases.</div><div><br></div><div>=A0=A0 =A0 I will spend mor=
e time to find out a more definitive answer.</div><div><br></div><div>API:<=
/div><div>
<br></div><div>=A0=A0 =A0The new FieldComparator API is not difficult to un=
derstand (especially for Lucene experts such as yourselves), but it is more=
 involved in comparison to the ScoreDocComparator API. I think anyone would=
 agree with that. Furthermore, when implementing some custom comparators, (=
examples I have given earlier in this thread), it can be difficult to imple=
ment while maintaining performance. =A0 =A0</div>
<div><br></div><div>=A0=A0 =A0I understand changing API is hard, that is wh=
y I am trying to raise this as soon as possible, and it could very well be =
that the current API is fine.</div><div><br></div><div>=A0=A0 =A0Lucene&#39=
;s collector api allows anyone to plugin any sorting algorithm, kinda like =
what Mike has done with the tests. So it is ok if an API selected does not =
fit the needs for everyone.</div>
<div><br></div><div>=A0=A0 =A0In conclusion, please understand I am not try=
ing to be &quot;right&quot; on this, just trying to learn and to understand=
, which I did from reading and trying to understand the code, along with gu=
idance from Mike and Yonik and I am more than impressed with the thoughts a=
nd code tuning that went into it.</div>
<div><br></div><div>Thanks</div><div><br></div><div>-John</div><div><br><di=
v class=3D"gmail_quote">On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:yonik@lucidimagination.com">yonik@lucidi=
magination.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;"><div class=3D"im">On Thu, Oct 22, 2009 at 1=
1:11 PM, Jake Mannix &lt;<a href=3D"mailto:jake.mannix@gmail.com">jake.mann=
ix@gmail.com</a>&gt; wrote:<br>

&gt; It&#39;s hard to read the column format, but if you look up above in t=
he thread<br>
&gt; from tonight,<br>
&gt; you can see that yes, for PQ sizes less than 100 elements, multiPQ is<=
br>
&gt; better, and only<br>
&gt; starts to be worse at around 100 for strings, and 50 for ints.<br>
<br>
</div>Ah, OK, I had missed John&#39;s followup with the numbers.<br>
<br>
I assume this is for Java5 + optimizations?<br>
What does Java6 show?<br>
<br>
bq. Point 2 does indeed make a difference, we&#39;ve seen it, and it&#39;s<=
br>
<div class=3D"im">only fair: the single pq comparator does this branch opti=
mization but<br>
the current patch multi-pq does not, so let&#39;s level the playing field.<=
br>
<br>
</div>Of course - it&#39;s not about leveling the playing field, but findin=
g the<br>
best solution for the average case - so everything should be optimized<br>
as much as possible. =A0There are probably further optimizations<br>
possible in both the single and multi PQ cases.<br>
<br>
My biggest reservation is that we&#39;ve gone down the road of telling<br>
people to implement a new style of comparators, and told them that the<br>
old style comparators would be deleted in the next release (which is<br>
where we are). =A0Reversing that will be a bit of a headache/question...<br=
>
the new stuff isn&#39;t deprecated, and having *both* isn&#39;t desirable, =
but<br>
that&#39;s a separate decision to be made apart from performance testing.<b=
r>
<br>
Is there also an option of using a multiPQ approach with the new style<br>
comparators?<br>
<div><div></div><div class=3D"h5"><br>
-Yonik<br>
<a href=3D"http://www.lucidimagination.com" target=3D"_blank">http://www.lu=
cidimagination.com</a><br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.apach=
e.org">java-dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucene.apa=
che.org">java-dev-help@lucene.apache.org</a><br>
<br>
</div></div></blockquote></div><br></div>

--001636b2b5255c5da904769333f7--