lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: OutOfMemoryError when using Sort
Date Thu, 12 Nov 2009 17:21:58 GMT
It is only sorting the top 50 hits, yes, but do do that, it needs to look at
the
*value* of the field for each and every of the billions of documents.  You
can
do this without using memory if you're willing to deal with disk seeks, but
doing billions of those are going to mean that this query most likely only
returns a few weeks from now (kidding, but only a little).

On Thu, Nov 12, 2009 at 9:08 AM, Nuno Seco <nseco@dei.uc.pt> wrote:

> Ok. Thanks.
>
> The doc. says:
> "Finds the top |n| hits for |query|, applying |filter| if non-null, and
> sorting the hits by the criteria in |sort|."
>
> I understood that only the hits (50 in this) for the current search would
> be sorted...
> I'll just do the ordering afterwards. Thank you for clarifying this issue.
>
>
> --
> Nuno Seco
>
>
>
> Jake Mannix wrote:
>
>> Sorting utilizes a FieldCache: the forward lookup - the value a document
>> has
>> for a
>> particular field (as opposed to the usual "inverted" way of looking at all
>> documents
>> which contains a given term), which lives in memory, and takes up as much
>> space
>> as one 4-bytes * numDocs.
>>
>> If you've indexed the entire 5Gram data set on one index, on one machine,
>> you've
>> got how many documents?  Billions?  This will take up billions*4bytes of
>> RAM
>> to
>> do this sort.
>>
>> You need to shard your index (break it up onto multiple machines, do your
>> sort
>> distributed, and merge the results) if you want to do this sorting with
>> any
>> kind
>> of performance.
>>
>>  -jake
>>
>> On Thu, Nov 12, 2009 at 7:57 AM, Nuno Seco <nseco@dei.uc.pt> wrote:
>>
>>
>>
>>> Hello List.
>>>
>>> I'm having a problem when I add a Sort object to my searcher:
>>>  docs = searcher.search(parser.parse(search), null, 50, sort);
>>>
>>> Every time I execute a query I get an OutOfMemoryError exception.
>>> But if I execute the query without the Sort object it works fine
>>>
>>> Let me briefly explain how my index is structured.
>>> I'm indexing the Google 5Grams
>>> (
>>>
>>> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
>>> ).
>>>
>>> The index just has two fields:
>>>  data = new Field("data", tokens[0], Field.Store.YES,
>>> Field.Index.ANALYZED, Field.TermVector.NO);
>>>  count = new Field("count", tokens[1], Field.Store.YES,
>>> Field.Index.NO, Field.TermVector.NO);
>>>
>>> the data corresponds to the 5 gram; e.g.: "my business manager informed
>>> me"
>>> and the count is simply an integer that represents the frequency of the
>>> ngram.
>>>
>>> The index size after optimization is 63G.
>>>
>>> If I do not store the data field using:
>>>  data = new Field("data", tokens[0], Field.Store.NO,
>>> Field.Index.ANALYZED, Field.TermVector.NO);
>>> the total size drops to 32G
>>>
>>>
>>> But using either index with the Sort object causes the exception. I'm
>>> creating the Sort object like:
>>>  Sort sort = new Sort(new SortField("count", SortField.INT));
>>>
>>> Note: That even with out using the Sort object I still need to pump the
>>> jvm to 2G (-Xmx2048m). But thats ok...
>>>
>>>
>>> So.... Basically what I want is to order those first 50 hits I get
>>> according to their frequency counts (count field).
>>>
>>>
>>> I'm using:
>>> java version "1.6.0_16" (64 bit)
>>> lucene 2.9.1
>>> linux ext3 FS
>>> linux kernel 2.6.31-15
>>>
>>> Can anybody help me or redirect me in the right direction?
>>>
>>> Thanks
>>>
>>> --
>>> Nuno Seco
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message