lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nuno Seco <ns...@dei.uc.pt>
Subject Re: OutOfMemoryError when using Sort
Date Thu, 12 Nov 2009 19:04:26 GMT
Thanks again

> I am not really sure, why it is enough for you to sort the first 50 highest
> ranking hits, but if you only want to do this, sorting afterwards is quite
> straightforward.
>   
Just to clarify.. And I know it may seem strange...
But I'll mostly be conducting "long"  phrase (3 or 4 word) searches on 
the index.
So I don't expect that many hits to appear given that all I have are 
5-grams.

If I have above 50 hits then problem solved, need not do anything else.
If I have below 50 hits then I want to look at the frequency count of 
the hits I got.

Anyway.... I was just curious about the behavior.
My fault. Thanks for all the ideas.

> Another idea is to not index the count itself, but more use the count as a
> boost factor for each document. The ranking algorithm of lucene will then
> boost docs with higher counts to the top results. But notice: The boost is
> combined with the already available boots (multiplied), so the sorting will
> be a combination of boost and native lucene rank. You can try to overcome
> that by scaling the boost very large (factor 10 or so should be enough for
> 5-grams).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>   
>> -----Original Message-----
>> From: Nuno Seco [mailto:nseco@dei.uc.pt]
>> Sent: Thursday, November 12, 2009 6:08 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: OutOfMemoryError when using Sort
>>
>> Ok. Thanks.
>>
>> The doc. says:
>> "Finds the top |n| hits for |query|, applying |filter| if non-null, and
>> sorting the hits by the criteria in |sort|."
>>
>> I understood that only the hits (50 in this) for the current search
>> would be sorted...
>> I'll just do the ordering afterwards. Thank you for clarifying this issue.
>>
>>
>> --
>> Nuno Seco
>>
>>
>> Jake Mannix wrote:
>>     
>>> Sorting utilizes a FieldCache: the forward lookup - the value a document
>>>       
>> has
>>     
>>> for a
>>> particular field (as opposed to the usual "inverted" way of looking at
>>>       
>> all
>>     
>>> documents
>>> which contains a given term), which lives in memory, and takes up as
>>>       
>> much
>>     
>>> space
>>> as one 4-bytes * numDocs.
>>>
>>> If you've indexed the entire 5Gram data set on one index, on one
>>>       
>> machine,
>>     
>>> you've
>>> got how many documents?  Billions?  This will take up billions*4bytes of
>>>       
>> RAM
>>     
>>> to
>>> do this sort.
>>>
>>> You need to shard your index (break it up onto multiple machines, do
>>>       
>> your
>>     
>>> sort
>>> distributed, and merge the results) if you want to do this sorting with
>>>       
>> any
>>     
>>> kind
>>> of performance.
>>>
>>>   -jake
>>>
>>> On Thu, Nov 12, 2009 at 7:57 AM, Nuno Seco <nseco@dei.uc.pt> wrote:
>>>
>>>
>>>       
>>>> Hello List.
>>>>
>>>> I'm having a problem when I add a Sort object to my searcher:
>>>>   docs = searcher.search(parser.parse(search), null, 50, sort);
>>>>
>>>> Every time I execute a query I get an OutOfMemoryError exception.
>>>> But if I execute the query without the Sort object it works fine
>>>>
>>>> Let me briefly explain how my index is structured.
>>>> I'm indexing the Google 5Grams
>>>> (
>>>> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-
>>>>         
>> to-you.html).
>>     
>>>> The index just has two fields:
>>>>   data = new Field("data", tokens[0], Field.Store.YES,
>>>> Field.Index.ANALYZED, Field.TermVector.NO);
>>>>   count = new Field("count", tokens[1], Field.Store.YES,
>>>> Field.Index.NO, Field.TermVector.NO);
>>>>
>>>> the data corresponds to the 5 gram; e.g.: "my business manager informed
>>>>         
>> me"
>>     
>>>> and the count is simply an integer that represents the frequency of the
>>>> ngram.
>>>>
>>>> The index size after optimization is 63G.
>>>>
>>>> If I do not store the data field using:
>>>>   data = new Field("data", tokens[0], Field.Store.NO,
>>>> Field.Index.ANALYZED, Field.TermVector.NO);
>>>> the total size drops to 32G
>>>>
>>>>
>>>> But using either index with the Sort object causes the exception. I'm
>>>> creating the Sort object like:
>>>>   Sort sort = new Sort(new SortField("count", SortField.INT));
>>>>
>>>> Note: That even with out using the Sort object I still need to pump the
>>>> jvm to 2G (-Xmx2048m). But thats ok...
>>>>
>>>>
>>>> So.... Basically what I want is to order those first 50 hits I get
>>>> according to their frequency counts (count field).
>>>>
>>>>
>>>> I'm using:
>>>> java version "1.6.0_16" (64 bit)
>>>> lucene 2.9.1
>>>> linux ext3 FS
>>>> linux kernel 2.6.31-15
>>>>
>>>> Can anybody help me or redirect me in the right direction?
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> Nuno Seco
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>         
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message