Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 21053 invoked from network); 23 Oct 2009 05:08:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Oct 2009 05:08:52 -0000 Received: (qmail 1488 invoked by uid 500); 23 Oct 2009 05:08:52 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 1422 invoked by uid 500); 23 Oct 2009 05:08:51 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 1414 invoked by uid 99); 23 Oct 2009 05:08:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Oct 2009 05:08:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.wang@gmail.com designates 209.85.217.219 as permitted sender) Received: from [209.85.217.219] (HELO mail-gx0-f219.google.com) (209.85.217.219) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Oct 2009 05:08:41 +0000 Received: by gxk19 with SMTP id 19so5170910gxk.5 for ; Thu, 22 Oct 2009 22:08:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=H0srNo5ntDScpLm7Zklcz3/AmkVQVdi+iI8+KP15Wmw=; b=xjLTc6mN/hUr50xn4HIBHWaJhYRs2wJKi/HVnWWPEmDKmT80Db7DXVZaMs7AtU5Zum q9FEA6jwkzmFvJWk+XHMywe3CaDUA2K5nbhNbCbNLk+YG8VOILHgTd0jjKg7cy7xnVpl kDb+H3wP5A2/EHup2bugdtR6qpuhsRDjlST3s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=qx+13R6JY14kl5e4iVxZn0yGUkkjH7oIaj9x4TT+ydO4LJ2yz1T6T5R+LeV/AK/raI km9Zjpa8oy38d8LVRJjhghz4uJPhgRZ68ceq14SfljAGcsQmzc4ZJ11hLLWbWQSLdZNj XBhOTvrDU7SsG0XbwcEeIrKKe2bf27BBZA9Lo= MIME-Version: 1.0 Received: by 10.100.236.28 with SMTP id j28mr6651011anh.37.1256274500803; Thu, 22 Oct 2009 22:08:20 -0700 (PDT) In-Reply-To: References: <8837fb770910142012w5f11ba57y99aeb18493603813@mail.gmail.com> <9ac0c6aa0910210311nffa18c4gd3acbf0e73ef7b58@mail.gmail.com> <8837fb770910212317g5a8c09ecid1810f301087583c@mail.gmail.com> <9ac0c6aa0910220238l5cdce87g314ea18cd85bd695@mail.gmail.com> <8837fb770910221837p287db022i6ed18349977381c6@mail.gmail.com> <4AE114FA.7050309@gmail.com> <8837fb770910221935t6e744eedi2cc129f5df17d508@mail.gmail.com> <4b124c310910222011x10d03a7v608a170340880c75@mail.gmail.com> Date: Thu, 22 Oct 2009 22:08:20 -0700 Message-ID: <8837fb770910222208y50704ae9kfe225e1a5a9d287f@mail.gmail.com> Subject: Re: lucene 2.9 sorting algorithm From: John Wang To: java-dev@lucene.apache.org, yonik@lucidimagination.com Content-Type: multipart/alternative; boundary=001636b2b5255c5da904769333f7 X-Virus-Checked: Checked by ClamAV on apache.org --001636b2b5255c5da904769333f7 Content-Type: text/plain; charset=ISO-8859-1 Hi Yonik: I have been head deep in this trying to find out a good solution for better part of the past two days, it's been hard because there are so many variables: 1) how optimized are the code from either of the implementations 2) VM difference 3) HW etc. Also, there are quite a few dimensions this issue is being discussed on: Algorithm: I think we should NOT jump to the conclusion that my number on the multiQ is valid until others reproduce it (which is one of the reason I asked mike to run his benchmark again with 1.5) I am gonna try to run it on server machines when I get back to my office next week. Overall, I think the single Q algorithm is better. (It however does pay a price for some string compares etc.), Its benefit becomes more and more significant when the product of PQ size and segment count increases, which makes complete sense from the algorithm. However, when PQ size is small (which is in most of the cases, the multiplier on the segment count is also small) the benefit is not as obvious. And sometimes the trade-off for the constant string compare cost may not be worth it. (this remains a hypothesis) With Java 1.6, maybe the singleQ approach is a winner in all cases. I will spend more time to find out a more definitive answer. API: The new FieldComparator API is not difficult to understand (especially for Lucene experts such as yourselves), but it is more involved in comparison to the ScoreDocComparator API. I think anyone would agree with that. Furthermore, when implementing some custom comparators, (examples I have given earlier in this thread), it can be difficult to implement while maintaining performance. I understand changing API is hard, that is why I am trying to raise this as soon as possible, and it could very well be that the current API is fine. Lucene's collector api allows anyone to plugin any sorting algorithm, kinda like what Mike has done with the tests. So it is ok if an API selected does not fit the needs for everyone. In conclusion, please understand I am not trying to be "right" on this, just trying to learn and to understand, which I did from reading and trying to understand the code, along with guidance from Mike and Yonik and I am more than impressed with the thoughts and code tuning that went into it. Thanks -John On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley wrote: > On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix > wrote: > > It's hard to read the column format, but if you look up above in the > thread > > from tonight, > > you can see that yes, for PQ sizes less than 100 elements, multiPQ is > > better, and only > > starts to be worse at around 100 for strings, and 50 for ints. > > Ah, OK, I had missed John's followup with the numbers. > > I assume this is for Java5 + optimizations? > What does Java6 show? > > bq. Point 2 does indeed make a difference, we've seen it, and it's > only fair: the single pq comparator does this branch optimization but > the current patch multi-pq does not, so let's level the playing field. > > Of course - it's not about leveling the playing field, but finding the > best solution for the average case - so everything should be optimized > as much as possible. There are probably further optimizations > possible in both the single and multi PQ cases. > > My biggest reservation is that we've gone down the road of telling > people to implement a new style of comparators, and told them that the > old style comparators would be deleted in the next release (which is > where we are). Reversing that will be a bit of a headache/question... > the new stuff isn't deprecated, and having *both* isn't desirable, but > that's a separate decision to be made apart from performance testing. > > Is there also an option of using a multiPQ approach with the new style > comparators? > > -Yonik > http://www.lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --001636b2b5255c5da904769333f7 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Yonik:

=A0=A0 =A0I have been head deep in this trying= to find out a good solution for better part of the past two days, it's= been hard because there are so many variables:

1)= how optimized are the code from either of the implementations
2) VM difference
3) HW etc.

=A0=A0 = =A0Also, there are quite a few dimensions this issue is being discussed on:=

Algorithm:

=A0=A0 =A0I t= hink we should NOT jump to the conclusion that my number on the multiQ is v= alid until others reproduce it (which is one of the reason I asked mike to = run his benchmark again with 1.5) I am gonna try to run it on server machin= es when I get back to my office next week.

=A0=A0 =A0Overall, I think the single Q algorithm is be= tter. (It however does pay a price for some string compares etc.), Its bene= fit becomes more and more significant when the product of PQ size and segme= nt count increases, which makes complete sense from the algorithm. However,= when PQ size is small (which is in most of the cases, the multiplier on th= e segment count is also small) the benefit is not as obvious. And sometimes= the trade-off for the constant string compare cost may not be worth it. (t= his remains a hypothesis)

=A0=A0 =A0With Java 1.6, maybe the singleQ approach is = a winner in all cases.

=A0=A0 =A0 I will spend mor= e time to find out a more definitive answer.

API:<= /div>

=A0=A0 =A0The new FieldComparator API is not difficult to un= derstand (especially for Lucene experts such as yourselves), but it is more= involved in comparison to the ScoreDocComparator API. I think anyone would= agree with that. Furthermore, when implementing some custom comparators, (= examples I have given earlier in this thread), it can be difficult to imple= ment while maintaining performance. =A0 =A0

=A0=A0 =A0I understand changing API is hard, that is wh= y I am trying to raise this as soon as possible, and it could very well be = that the current API is fine.

=A0=A0 =A0Lucene'= ;s collector api allows anyone to plugin any sorting algorithm, kinda like = what Mike has done with the tests. So it is ok if an API selected does not = fit the needs for everyone.

=A0=A0 =A0In conclusion, please understand I am not try= ing to be "right" on this, just trying to learn and to understand= , which I did from reading and trying to understand the code, along with gu= idance from Mike and Yonik and I am more than impressed with the thoughts a= nd code tuning that went into it.

Thanks

-John

On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley <yonik@lucidi= magination.com> wrote:
On Thu, Oct 22, 2009 at 1= 1:11 PM, Jake Mannix <jake.mann= ix@gmail.com> wrote:
> It's hard to read the column format, but if you look up above in t= he thread
> from tonight,
> you can see that yes, for PQ sizes less than 100 elements, multiPQ is<= br> > better, and only
> starts to be worse at around 100 for strings, and 50 for ints.

Ah, OK, I had missed John's followup with the numbers.

I assume this is for Java5 + optimizations?
What does Java6 show?

bq. Point 2 does indeed make a difference, we've seen it, and it's<= br>
only fair: the single pq comparator does this branch opti= mization but
the current patch multi-pq does not, so let's level the playing field.<= br>
Of course - it's not about leveling the playing field, but findin= g the
best solution for the average case - so everything should be optimized
as much as possible. =A0There are probably further optimizations
possible in both the single and multi PQ cases.

My biggest reservation is that we've gone down the road of telling
people to implement a new style of comparators, and told them that the
old style comparators would be deleted in the next release (which is
where we are). =A0Reversing that will be a bit of a headache/question... the new stuff isn't deprecated, and having *both* isn't desirable, = but
that's a separate decision to be made apart from performance testing.
Is there also an option of using a multiPQ approach with the new style
comparators?

-Yonik
http://www.lu= cidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


--001636b2b5255c5da904769333f7--