lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Sorting with little memory: A suggestion
Date Fri, 19 Mar 2010 17:26:11 GMT
If you build the ords per-segment, how do you compare results across segments?

Ie, in the non-Collator case, Lucene stores ords but must also store
the actual String so that the FieldComparator is able to compare
results across segments


On Fri, Mar 19, 2010 at 10:06 AM, Toke Eskildsen <> wrote:
> I've been tinkering with the SegmentReader and tried to make a sorter
> that keeps track of ordinals for terms instead of the terms themselves.
> I call it Exposed, as it exposes inner details.
> A detailed description of the idea can be found at
> but the important points are below.
> The idea is simple enough: Make an array of ordinals for the sort-terms,
> sort them with a Collator that performs a lookup of the actual Strings.
> Using that array of ordinals, another array going from docID to index in
> the sorted ordinals array is created.
> When a sort is performed, a comparison of the entries in the
> docID->sorted ordinals array is enough to determine the order.
> The memory savings compared to the traditional approach of holding the
> terms in memory are substantial. When the structure has been build, it
> takes ~5MB to hold sorting information for 1M docs containing 1M unique
> terms in the sort field, ~55MB for 10M (using the PackedInts from
> LUCENE-1990).
> Temporary memory use for building is around 2 * #docs * 4 bytes for
> merge sort + cruft. In theory. Also in theory, this could be closer to
> #docs * 4 bytes. In reality I haven't had the time to switch Integers to
> ints in the sorter, so it's quite a bit more.
> As all sorting is done upfront, the startup time is non-trivial.
> For ~10M sort terms it takes 6 minutes on my i7-machine using a
> conventional harddisk. After that, sorted searches are a factor 10
> faster than sorting by String-compare for ~1M hits (top 20 displayed).
> To kick things off, I've created a Lucene-fork at
> with the tag lucene_3_0_exposed.
> It contains a proof of concept that can be used on an optimized Lucene
> index. It can be tested by running
> java-cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar
> org.apache.lucene.index.ExposedPOC.
> To me, the trade-offs seems to be
> new Sort(new SortField(field, locale))
>  * Fast startup
>  * High memory consumption
>  * Slow sorted searches (
>  * Instantaneous String delivery
> Exposed
>  * Slow startup (this is per-segment, so re-open should work well)
>  * Low temporary and very low permanent memory consumption
>  * Fast sorted searches (integer compare)
>  * Fast String delivery (Term-lookup by ordinal)
> As memory-problems with sorting is a recurring topic on the user-list,
> having an easy alternative to buying more RAM would be nice. I would
> like to hear if Exposed sounds like a feasible idea to the more seasoned
> Lucene developers.
> Regards,
> Toke Eskildsen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message