Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: 209.85.219.219 is neither permitted
 nor denied by domain of tomshally17@googlemail.com)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:date:message-id:subject:from:to:content-type;
        b=CKbyzIGbp2CulFAbuUr2hHdmmBgUT1PrMYFcfg24gCBeIB2up4Qvu+boUIHGGDtbad
         QZ+14cs7fB1VBZw76x8Wfx3EY+TSooEDP1A3R+vCqkVsjyXzd8r6y7Y+UgldzW/WJOFJ
         CKFRMnqF47yI0tdZtdLUDO33uAr0R3stvgGzM=
MIME-Version: 1.0
Date: Wed, 21 Oct 2009 00:07:16 +0200
Message-ID: <dd9aa9e0910201507x44866bebseda800fc71b2506c@mail.gmail.com>
Subject: Re: lucene 2.9 sorting algorithm
From: TomS <tomshally17@googlemail.com>
To: java-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7ef38d305f9047665159d

--0016e6d7ef38d305f9047665159d
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I can confirm the below mentioned problems trying to migrate to 2.9.

Our Lucene-based (2.4) app uses custom multi-level sorting on a lot of
different fields and pretty large indexes (> 100m docs). Most of the fields
that we sort on are strings, some with up to 400 characters in length. A lot
of the strings share a common prefix, so comparing these
character-by-character is costly. As the comparison proved to be a hotspot
(and ordinary string field caches a memory-hog) we've been using a similar
approach like the one mentioned by John Wang for years now, starting with
Lucene 1.x: Using a map from doc id to ordinals (int->int) where the ordinal
represents the order imposed by the (costly) comparison.

This has been rather simple to implement and gives our app sizeable
performance gains. Furthermore, it saves a lot of memory compared to
standard field caches.

With the new API, though, the simple mapping from doc id to pre-generated
sort ordinals won't work anymore. This is the only reason that has prevented
us from upgrading to 2.9. Currently it seems to us that adapting to the new
sorting API would be either much more complicated to implement, or - when
reverting to standard field cache-based sorting - slower and use far more
memory. On the other hand, we might not have a full understanding of 2.9's
inner workings yet, being used to (hopefully not stuck with) our a.m. custom
solution.

Other than that, 2.9 seems to be a very nice release. Thanks for the great
work!

-Tom


On Tue, Oct 20, 2009 at 8:55 AM, John Wang <john.wang@gmail.com> wrote:
[...]
Let me provide an example:

    We have a multi valued field on integers, we define a sort on this set
of strings by defining a comparator on each value to be similar to a lex
order, instead of compare on characters, we do on strings, we also want to
keep the multi value representation as we do filtering and facet counting on
it. The in memory representation is similar to the UnInvertedField in Solr.

   Implementing a sort with the old API was rather simple, as we only needed
mapping from a docid to a set of ordinals. With the new api, we needed to do
a "conversion", which would mean mapping a set of String/ordinals back to a
doc. Which is to me, is not trivial, let alone performance implications.

   That actually gave us to motivation to see if the old api can handle the
segment level changes that was made in 2.9 (which in my opinion is the best
thing in lucene since payloads :) )

   So after some investigation, with code and big O analysis, and
discussions with Mike and Yonik, on our end, we feel given the performance
numbers, it is unnecessary to go with the more complicated API.

Thanks

-John

--0016e6d7ef38d305f9047665159d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi,<br><br>I can confirm the below mentioned problems trying to migrate to =
2.9.<br><br>Our Lucene-based (2.4) app uses custom multi-level sorting on a=
 lot of different fields and pretty large indexes (&gt; 100m docs). Most of=
 the fields that we sort on are strings, some with up to 400 characters in =
length. A lot of the strings share a common prefix, so comparing these char=
acter-by-character is costly. As the comparison proved to be a hotspot (and=
 ordinary string field caches a memory-hog) we&#39;ve been using a similar =
approach like the one mentioned by John Wang for years now, starting with L=
ucene 1.x: Using a map from doc id to ordinals (int-&gt;int) where the ordi=
nal represents the order imposed by the (costly) comparison.<br>
<br>This has been rather simple to implement and gives our app sizeable per=
formance gains. Furthermore, it saves a lot of memory compared to standard =
field caches.<br><br>With the new API, though, the simple mapping from doc =
id to pre-generated sort ordinals won&#39;t work anymore. This is the only =
reason that has prevented us from upgrading to 2.9. Currently it seems to u=
s that adapting to the new sorting API would be either much more complicate=
d to implement, or - when reverting to standard field cache-based sorting -=
 slower and use far more memory. On the other hand, we might not have a ful=
l understanding of 2.9&#39;s inner workings yet, being used to (hopefully n=
ot stuck with) our a.m. custom solution.<br>
<br>Other than that, 2.9 seems to be a very nice release. Thanks for the gr=
eat work!<br><br>-Tom<br><br><br>On Tue, Oct 20, 2009 at 8:55 AM, John Wang=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:john.wang@gmail.com" target=3D"_bl=
ank">john.wang@gmail.com</a>&gt;</span> wrote:<br>

[...]<br><div>Let me provide an example:</div>

<div><br></div><div>=A0=A0 =A0We have a multi valued field on integers, we
define a sort on this set of strings by defining a comparator on each
value to be similar to a lex order, instead of compare on characters,
we do on strings, we also want to keep the multi value representation
as we do filtering and facet counting on it. The in memory
representation is similar to the UnInvertedField in Solr.</div>

<div><br></div><div>=A0=A0 Implementing a sort with the old API was rather
simple, as we only needed mapping from a docid to a set of ordinals.
With the new api, we needed to do a &quot;conversion&quot;, which would mea=
n
mapping a set of String/ordinals back to a doc. Which is to me, is not
trivial, let alone performance implications.</div>

<div><br></div><div>=A0=A0 That actually gave us to motivation to see if
the old api can handle the segment level changes that was made in 2.9
(which in my opinion is the best thing in lucene since payloads :) )</div><=
div><br>

</div><div>=A0=A0 So after some investigation, with code and big O
analysis, and discussions with Mike and Yonik, on our end, we feel
given the performance numbers, it is unnecessary to go with the more
complicated API.</div>

<div><br></div><div>Thanks</div><div><br></div><font color=3D"#888888"><div=
>-John</div></font><div><br></div><div><br></div>

--0016e6d7ef38d305f9047665159d--