Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 40991 invoked from network); 20 Oct 2009 22:07:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Oct 2009 22:07:42 -0000 Received: (qmail 17966 invoked by uid 500); 20 Oct 2009 22:07:41 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 17903 invoked by uid 500); 20 Oct 2009 22:07:41 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 17895 invoked by uid 99); 20 Oct 2009 22:07:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Oct 2009 22:07:41 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 209.85.219.219 is neither permitted nor denied by domain of tomshally17@googlemail.com) Received: from [209.85.219.219] (HELO mail-ew0-f219.google.com) (209.85.219.219) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Oct 2009 22:07:38 +0000 Received: by ewy19 with SMTP id 19so5863115ewy.28 for ; Tue, 20 Oct 2009 15:07:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=405USHVawLTgzLZW+XaHZkaxmJqLDVJzOaQBemQunDU=; b=vYyl2c7sh+lflO5Gl38MmwoXCtkoAYxMZII5ifqBOukuWUNZbPoXVxrgy7E9duGQGd YwHdogE7zj5lpfd5H9k4EuB+8QiM9jSuUh+ZsD2SEDFFbWNxc6UYi0DuvnqEpYYlb1pc aqLRyolWb5aLHGcHG5k/iuptAHHZlt2pxuoxE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=CKbyzIGbp2CulFAbuUr2hHdmmBgUT1PrMYFcfg24gCBeIB2up4Qvu+boUIHGGDtbad QZ+14cs7fB1VBZw76x8Wfx3EY+TSooEDP1A3R+vCqkVsjyXzd8r6y7Y+UgldzW/WJOFJ CKFRMnqF47yI0tdZtdLUDO33uAr0R3stvgGzM= MIME-Version: 1.0 Received: by 10.216.86.148 with SMTP id w20mr2675265wee.138.1256076436767; Tue, 20 Oct 2009 15:07:16 -0700 (PDT) Date: Wed, 21 Oct 2009 00:07:16 +0200 Message-ID: Subject: Re: lucene 2.9 sorting algorithm From: TomS To: java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6d7ef38d305f9047665159d --0016e6d7ef38d305f9047665159d Content-Type: text/plain; charset=ISO-8859-1 Hi, I can confirm the below mentioned problems trying to migrate to 2.9. Our Lucene-based (2.4) app uses custom multi-level sorting on a lot of different fields and pretty large indexes (> 100m docs). Most of the fields that we sort on are strings, some with up to 400 characters in length. A lot of the strings share a common prefix, so comparing these character-by-character is costly. As the comparison proved to be a hotspot (and ordinary string field caches a memory-hog) we've been using a similar approach like the one mentioned by John Wang for years now, starting with Lucene 1.x: Using a map from doc id to ordinals (int->int) where the ordinal represents the order imposed by the (costly) comparison. This has been rather simple to implement and gives our app sizeable performance gains. Furthermore, it saves a lot of memory compared to standard field caches. With the new API, though, the simple mapping from doc id to pre-generated sort ordinals won't work anymore. This is the only reason that has prevented us from upgrading to 2.9. Currently it seems to us that adapting to the new sorting API would be either much more complicated to implement, or - when reverting to standard field cache-based sorting - slower and use far more memory. On the other hand, we might not have a full understanding of 2.9's inner workings yet, being used to (hopefully not stuck with) our a.m. custom solution. Other than that, 2.9 seems to be a very nice release. Thanks for the great work! -Tom On Tue, Oct 20, 2009 at 8:55 AM, John Wang wrote: [...] Let me provide an example: We have a multi valued field on integers, we define a sort on this set of strings by defining a comparator on each value to be similar to a lex order, instead of compare on characters, we do on strings, we also want to keep the multi value representation as we do filtering and facet counting on it. The in memory representation is similar to the UnInvertedField in Solr. Implementing a sort with the old API was rather simple, as we only needed mapping from a docid to a set of ordinals. With the new api, we needed to do a "conversion", which would mean mapping a set of String/ordinals back to a doc. Which is to me, is not trivial, let alone performance implications. That actually gave us to motivation to see if the old api can handle the segment level changes that was made in 2.9 (which in my opinion is the best thing in lucene since payloads :) ) So after some investigation, with code and big O analysis, and discussions with Mike and Yonik, on our end, we feel given the performance numbers, it is unnecessary to go with the more complicated API. Thanks -John --0016e6d7ef38d305f9047665159d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

I can confirm the below mentioned problems trying to migrate to = 2.9.

Our Lucene-based (2.4) app uses custom multi-level sorting on a= lot of different fields and pretty large indexes (> 100m docs). Most of= the fields that we sort on are strings, some with up to 400 characters in = length. A lot of the strings share a common prefix, so comparing these char= acter-by-character is costly. As the comparison proved to be a hotspot (and= ordinary string field caches a memory-hog) we've been using a similar = approach like the one mentioned by John Wang for years now, starting with L= ucene 1.x: Using a map from doc id to ordinals (int->int) where the ordi= nal represents the order imposed by the (costly) comparison.

This has been rather simple to implement and gives our app sizeable per= formance gains. Furthermore, it saves a lot of memory compared to standard = field caches.

With the new API, though, the simple mapping from doc = id to pre-generated sort ordinals won't work anymore. This is the only = reason that has prevented us from upgrading to 2.9. Currently it seems to u= s that adapting to the new sorting API would be either much more complicate= d to implement, or - when reverting to standard field cache-based sorting -= slower and use far more memory. On the other hand, we might not have a ful= l understanding of 2.9's inner workings yet, being used to (hopefully n= ot stuck with) our a.m. custom solution.

Other than that, 2.9 seems to be a very nice release. Thanks for the gr= eat work!

-Tom


On Tue, Oct 20, 2009 at 8:55 AM, John Wang= <john.wang@gmail.com> wrote:
[...]
Let me provide an example:

=A0=A0 =A0We have a multi valued field on integers, we define a sort on this set of strings by defining a comparator on each value to be similar to a lex order, instead of compare on characters, we do on strings, we also want to keep the multi value representation as we do filtering and facet counting on it. The in memory representation is similar to the UnInvertedField in Solr.

=A0=A0 Implementing a sort with the old API was rather simple, as we only needed mapping from a docid to a set of ordinals. With the new api, we needed to do a "conversion", which would mea= n mapping a set of String/ordinals back to a doc. Which is to me, is not trivial, let alone performance implications.

=A0=A0 That actually gave us to motivation to see if the old api can handle the segment level changes that was made in 2.9 (which in my opinion is the best thing in lucene since payloads :) )
<= div>
=A0=A0 So after some investigation, with code and big O analysis, and discussions with Mike and Yonik, on our end, we feel given the performance numbers, it is unnecessary to go with the more complicated API.

Thanks

-John


--0016e6d7ef38d305f9047665159d--