lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Sorting with little memory: A suggestion
Date Fri, 19 Mar 2010 17:26:11 GMT
If you build the ords per-segment, how do you compare results across segments?

Ie, in the non-Collator case, Lucene stores ords but must also store
the actual String so that the FieldComparator is able to compare
results across segments

Mike

On Fri, Mar 19, 2010 at 10:06 AM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> I've been tinkering with the SegmentReader and tried to make a sorter
> that keeps track of ordinals for terms instead of the terms themselves.
> I call it Exposed, as it exposes inner details.
>
> A detailed description of the idea can be found at
> http://sbdevel.wordpress.com/2010/03/19/string-sorting-in-lucene-without-the-memory-penalty/
> but the important points are below.
>
>
> The idea is simple enough: Make an array of ordinals for the sort-terms,
> sort them with a Collator that performs a lookup of the actual Strings.
> Using that array of ordinals, another array going from docID to index in
> the sorted ordinals array is created.
>
> When a sort is performed, a comparison of the entries in the
> docID->sorted ordinals array is enough to determine the order.
>
> The memory savings compared to the traditional approach of holding the
> terms in memory are substantial. When the structure has been build, it
> takes ~5MB to hold sorting information for 1M docs containing 1M unique
> terms in the sort field, ~55MB for 10M (using the PackedInts from
> LUCENE-1990).
>
> Temporary memory use for building is around 2 * #docs * 4 bytes for
> merge sort + cruft. In theory. Also in theory, this could be closer to
> #docs * 4 bytes. In reality I haven't had the time to switch Integers to
> ints in the sorter, so it's quite a bit more.
>
> As all sorting is done upfront, the startup time is non-trivial.
> For ~10M sort terms it takes 6 minutes on my i7-machine using a
> conventional harddisk. After that, sorted searches are a factor 10
> faster than sorting by String-compare for ~1M hits (top 20 displayed).
>
>
> To kick things off, I've created a Lucene-fork at
> http://github.com/tokee/lucene/ with the tag lucene_3_0_exposed.
> It contains a proof of concept that can be used on an optimized Lucene
> index. It can be tested by running
> java-cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar
> org.apache.lucene.index.ExposedPOC.
>
>
> To me, the trade-offs seems to be
>
> new Sort(new SortField(field, locale))
>  * Fast startup
>  * High memory consumption
>  * Slow sorted searches (Collator.compare)
>  * Instantaneous String delivery
>
> Exposed
>  * Slow startup (this is per-segment, so re-open should work well)
>  * Low temporary and very low permanent memory consumption
>  * Fast sorted searches (integer compare)
>  * Fast String delivery (Term-lookup by ordinal)
>
>
> As memory-problems with sorting is a recurring topic on the user-list,
> having an easy alternative to buying more RAM would be nice. I would
> like to hear if Exposed sounds like a feasible idea to the more seasoned
> Lucene developers.
>
> Regards,
> Toke Eskildsen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message