lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen>
Subject Re: Sort runs out of memory
Date Mon, 21 May 2012 07:54:38 GMT
On Thu, 2012-05-17 at 23:03 +0200, Robert Bart wrote:
> I am running Lucene 3.6 in a system that indexes about 4 billion documents
> across several indexes, and I'm hoping to get documents in order of a
> certain NumericField.

What is the maximum size on any single index, in terms of number of
documents? What is the type of the NumericField?

> I've tried using Lucene's Sort implementation, but it looks like it tries
> to do the entire sort in memory by allocating a huge array with space for
> every document in the index.

The FieldCache allocates an array of length #documents with the same
type that your NumericField is. The sort itself is of the sliding window
type, meaning that it only takes up memory lineary to the number of
documents wanted in the response. Do you require millions of documents
to be returned as part of a search?

Sanity check: You do specify the type when performing a sorted search,
right? If not, the values will be treated as Strings.

>  On my index, this quickly runs out of memory.

Assuming that your largest index is 1B documents and that your
NumericField is of type Integer, the FieldCache's values for the sort
should take up 1B * 4 = 4GB. Are you hoping for less?

> Are there any alternatives or better ways of getting documents in order of
> a NumericField for a very large index?

Be sure to select the type of NumericField to be as small as possible.
If you have few unique sort values (e.g. 17, 80, 2000 and 5678), you
might map them down (to 0, 1, 2 and 3 for this example) and store them
as a byte.

Currently Lucene only supports atomic types for numerics in the
FieldCache, so the smallest one is byte. It is possible to use only
ceil(log2(#unique_values)) bits/document, although that requires a bit
of custom coding.

Toke Eskildsen

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message