lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: Memory issues
Date Tue, 06 Sep 2011 06:50:49 GMT
On Sat, 2011-09-03 at 20:09 +0200, Michael Bell wrote:
> To be exact, there are about 300 million documents. This is running on a 64 bit JVM/64
bit OS with 24 GB(!) RAM allocated.

How much memory is allocated to the JVM?

> Now, their searches are working fine IF you do not SORT the results. If you do SORT,
you get stuff like
> 
> 2011-08-30 13:01:31,489 [TP-Processor8] ERROR com.gwava.utils.ServerErrorHandlerStrategy
- reportError: nastybadthing :: com.gwava.indexing.lucene.internal.LuceneSearchController.performSearchOperation:229
:: EXCEPTION : java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
>  at org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:624)
[...]

> Looking at the sort class, the api docs appear to say it would create an element of 1.2
billion items (4*300m). 

The StringIndexCache in Lucene 3 keeps two arrays in memory: int[#docs]
and String[#docs+1]. With 300M documents that is 1.2 billion bytes for
the int-array, which should not be a problem for the machine.

Unfortunately the String-array is a big problem. Keeping in mind that a
String in Java takes up approximately 50 + 2 * length bytes and setting
the average length of the terms to 10 chars, the array takes up a
maximum of 300M * (50 + 2 * 10) byte = 21,000 MByte or about 20 GByte.

In reality it is not that bad as duplicates only count once, but the
problem should be obvious.

> Is this correct? Is the issue going beyond signed int32 limits of an array ( 2 billion
items) or is it really a memory issue? How best to diagnose?

Open your index with Luke and count the number of unique terms for your
sort field. Using the formula above, you'll get an estimate of the
memory required for sorting on String in Lucene 3.

The int32 limit is only for the number of unique terms and there is a
maximum of one term/document when sorting. With 300M documents there's a
lot of room before that will be a problem.

If your field is numeric, changing the sort type should solve your
problem. If you really are comparing Strings, it is not so easy.

Lucene 4 is unfortunately not ready for production, but it has huge
improvements with regard to memory usage on sorting.

If you are feeling adventurous, you can take a look at
https://issues.apache.org/jira/browse/LUCENE-2369
which drastically reduces the memory needed for sorting. An experiment
with 200M unique terms required 1,7 GByte with the trade-off that it
took 8 minutes to open the index. One of the earlier patches works
against Lucene 3, while the later ones are Lucene 4 only.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message