lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject RE: Sorting with little memory: A suggestion
Date Fri, 19 Mar 2010 17:46:06 GMT
From: Robert Muir [rcmuir@gmail.com]:
> Toke, only partially-on-topic here, is it possible to describe your
> use-case a little more where its preferable to use this Locale-based
> sort instead of indexing collation keys (e.g. you have to support so
> many locales this would be too much indexing overhead?)

My original use case was to avoid the memory overhead: Looking at our current index, we have
~7.5M documents with ~7M unique titles. They take up about 362MB as UTF-8 bytes, which translates
to a neat 1GB of RAM as Java Strings. That's 1GB less heap for other stuff for us, plus a
sort is fairly slow. Indexing collation keys only helps with the speed problem. Using Exposed,
the permanent overhead would be 40MB with a temporary overhead of 60-80MB for building and
with far greater sort speed.

However, it is not set in stone that we will shift to using Exposed or similar: As many others
we're pursuing real-time indexing and while Exposed sits at the segment-level and thus works
well for re-open, big segment changes still occur from time to time. It is hard to see a way
around this, unless the initial build time can be optimized further (a co-worker gave a great
suggestion today and I think I know another way to shave 30-40% of). Still, for Exposed to
be the default choice, we'd need to get down to a few seconds for million-document segments.

It would be possible to generate the Exposed structure at indexing time, but since it works
on the whole term array (okay, the part to sort) of the segment, a small incremental update
would require rebuilding the whole structure for the changes segments. In reality it would
just move the same penalty from open to index.

> For reference, I tried to give some various tricks for using the
> Collation package here (admittedly for Solr, but the same principles
> apply): http://wiki.apache.org/solr/UnicodeCollation

I can see why this will give faster sorts over just indexing the Strings at the cost of memory
footprint.

> Separately, if there is some reason you really need to use the
> Locale-based runtime sort, perhaps we should rethink ways to allow the
> ICU collator to be used instead of the JDK one, somehow.

I see no reason not to support the ICU collator. As long as it can compare two Strings, I
am not at all religious about which comparator to use. Currently the seatch API works both
ways, providing methods both for simple Locale and for custom sorting, so I guess it would
be the same for Exposed.

Regards,
Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message