lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TomS <tomshall...@googlemail.com>
Subject Re: lucene 2.9 sorting algorithm
Date Tue, 20 Oct 2009 22:07:16 GMT
Hi,

I can confirm the below mentioned problems trying to migrate to 2.9.

Our Lucene-based (2.4) app uses custom multi-level sorting on a lot of
different fields and pretty large indexes (> 100m docs). Most of the fields
that we sort on are strings, some with up to 400 characters in length. A lot
of the strings share a common prefix, so comparing these
character-by-character is costly. As the comparison proved to be a hotspot
(and ordinary string field caches a memory-hog) we've been using a similar
approach like the one mentioned by John Wang for years now, starting with
Lucene 1.x: Using a map from doc id to ordinals (int->int) where the ordinal
represents the order imposed by the (costly) comparison.

This has been rather simple to implement and gives our app sizeable
performance gains. Furthermore, it saves a lot of memory compared to
standard field caches.

With the new API, though, the simple mapping from doc id to pre-generated
sort ordinals won't work anymore. This is the only reason that has prevented
us from upgrading to 2.9. Currently it seems to us that adapting to the new
sorting API would be either much more complicated to implement, or - when
reverting to standard field cache-based sorting - slower and use far more
memory. On the other hand, we might not have a full understanding of 2.9's
inner workings yet, being used to (hopefully not stuck with) our a.m. custom
solution.

Other than that, 2.9 seems to be a very nice release. Thanks for the great
work!

-Tom


On Tue, Oct 20, 2009 at 8:55 AM, John Wang <john.wang@gmail.com> wrote:
[...]
Let me provide an example:

    We have a multi valued field on integers, we define a sort on this set
of strings by defining a comparator on each value to be similar to a lex
order, instead of compare on characters, we do on strings, we also want to
keep the multi value representation as we do filtering and facet counting on
it. The in memory representation is similar to the UnInvertedField in Solr.

   Implementing a sort with the old API was rather simple, as we only needed
mapping from a docid to a set of ordinals. With the new api, we needed to do
a "conversion", which would mean mapping a set of String/ordinals back to a
doc. Which is to me, is not trivial, let alone performance implications.

   That actually gave us to motivation to see if the old api can handle the
segment level changes that was made in 2.9 (which in my opinion is the best
thing in lucene since payloads :) )

   So after some investigation, with code and big O analysis, and
discussions with Mike and Yonik, on our end, we feel given the performance
numbers, it is unnecessary to go with the more complicated API.

Thanks

-John

Mime
View raw message