lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nuno Seco <ns...@dei.uc.pt>
Subject Re: OutOfMemoryError when using Sort
Date Thu, 12 Nov 2009 17:08:16 GMT
Ok. Thanks.

The doc. says:
"Finds the top |n| hits for |query|, applying |filter| if non-null, and 
sorting the hits by the criteria in |sort|."

I understood that only the hits (50 in this) for the current search 
would be sorted...
I'll just do the ordering afterwards. Thank you for clarifying this issue.


--
Nuno Seco


Jake Mannix wrote:
> Sorting utilizes a FieldCache: the forward lookup - the value a document has
> for a
> particular field (as opposed to the usual "inverted" way of looking at all
> documents
> which contains a given term), which lives in memory, and takes up as much
> space
> as one 4-bytes * numDocs.
>
> If you've indexed the entire 5Gram data set on one index, on one machine,
> you've
> got how many documents?  Billions?  This will take up billions*4bytes of RAM
> to
> do this sort.
>
> You need to shard your index (break it up onto multiple machines, do your
> sort
> distributed, and merge the results) if you want to do this sorting with any
> kind
> of performance.
>
>   -jake
>
> On Thu, Nov 12, 2009 at 7:57 AM, Nuno Seco <nseco@dei.uc.pt> wrote:
>
>   
>> Hello List.
>>
>> I'm having a problem when I add a Sort object to my searcher:
>>   docs = searcher.search(parser.parse(search), null, 50, sort);
>>
>> Every time I execute a query I get an OutOfMemoryError exception.
>> But if I execute the query without the Sort object it works fine
>>
>> Let me briefly explain how my index is structured.
>> I'm indexing the Google 5Grams
>> (
>> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html).
>>
>> The index just has two fields:
>>   data = new Field("data", tokens[0], Field.Store.YES,
>> Field.Index.ANALYZED, Field.TermVector.NO);
>>   count = new Field("count", tokens[1], Field.Store.YES,
>> Field.Index.NO, Field.TermVector.NO);
>>
>> the data corresponds to the 5 gram; e.g.: "my business manager informed me"
>> and the count is simply an integer that represents the frequency of the
>> ngram.
>>
>> The index size after optimization is 63G.
>>
>> If I do not store the data field using:
>>   data = new Field("data", tokens[0], Field.Store.NO,
>> Field.Index.ANALYZED, Field.TermVector.NO);
>> the total size drops to 32G
>>
>>
>> But using either index with the Sort object causes the exception. I'm
>> creating the Sort object like:
>>   Sort sort = new Sort(new SortField("count", SortField.INT));
>>
>> Note: That even with out using the Sort object I still need to pump the
>> jvm to 2G (-Xmx2048m). But thats ok...
>>
>>
>> So.... Basically what I want is to order those first 50 hits I get
>> according to their frequency counts (count field).
>>
>>
>> I'm using:
>> java version "1.6.0_16" (64 bit)
>> lucene 2.9.1
>> linux ext3 FS
>> linux kernel 2.6.31-15
>>
>> Can anybody help me or redirect me in the right direction?
>>
>> Thanks
>>
>> --
>> Nuno Seco
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message