jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Boston <...@tfd.co.uk>
Subject Re: Query that sorts a large result set.
Date Wed, 17 Jun 2009 16:12:38 GMT

On 17 Jun 2009, at 10:04, Ard Schrijvers wrote:

> Hello,
>
> I think there is another issue you might hit before the part Marcel
> describes:
>
> Suppose, as your result set is 1.000.000 nodes, that you have  
> 10.000.000
> nodes containing a Date.
>
> A date is internal stored in lucene as 9 chars + the prefix of the
> propertyname, suppose 'lastModified' + some namespace & delimiters  
> overhead.
> say 5 chars. This is in total 26 chars (smaller name for  
> 'lastModified' will
> save memory in the end (though, it has been a while, so i might be  
> wrong)).
>
> Now, when you want to sort in lucene, first, *all* the lastModified  
> lucene
> terms are read in memory (suppose 26 chars ~ 100 bytes and  9 chars  
> ~ 80
> bytes memory)
>
> 10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the  
> jackrabbit
> SharedFieldCache will occupy another 10.000.000 * 80 bytes (+  
> overhead for
> nodes not having a date, which might be 90% * 4 bytes a piece)

This is what I was worried about,


>
> Anyways, conclusion, if you have 10.000.000 nodes with lastModified,  
> sorting
> on it will cost you directly 1.8 Gb, which cannot be freed by a GC,  
> but will
> be lost during the rest of jvm life (untill indexes merge, but this  
> is rare
> cornercase for big indices).


This makes me a bit more worried, since I thought that at least the  
memory would be GC'd at the end of the request,
So presumably if the user asks for the first 100 hits sorted by  
lastmodified, subject, status then; will each of those distinct  
searches consume additional memory not freed at the end of the request ?

There are 2 problems here for us, the UX people are demanding sorting  
of every column that is displayed, and we are using Sling which has a  
Search servlet that accepts XPath or SQL, so I can craft a query that  
will generate OOM for the JVM even if the UI is not causing the  
problem, we may have to remove that servlet, if my fears a real.



>
> Basically, this is imo the first issue on sorting large data sets  
> (if you
> sort in title or a property that contains large strings, memory is  
> gone even
> faster). Also, the doubling (1 Gb in lucene and .8 Gb in  
> SharedFieldCache
> could be avoided, but needs a large change wrt to indexing properties)
>
> Regarding the resultFetchSize, typically when you want to have an  
> archive
> where you want to diplay all pages is not an option, is it?

agreed, a UX with 1M items in a list isnt really usable, the max they  
want is 100, so there is not much point in fetching the entire set.

>
> I suppose that if I use a setLimit(3) on a query, that it runtime  
> lowers the
> resultFetchSize, isn't it? This would make it indeed much more  
> efficient if
> you only want the last 10 news items added. Is this correct?

I think so, if I follow you correctly.

>
> Regarding [2] I think would be nice if we can add this. If it  
> happens to be
> really hard, we could perhaps more easily create an indexing  
> configuration
> where we define the precision/granularity of the property Date to be
> indexed...this is easy and has a major performance increase, only, the
> precision is lowered for searching on dates.
>
> Regards Ard
>
>
> On Wed, Jun 17, 2009 at 10:13 AM, Marcel Reutegger <marcel.reutegger@gmx.net
>> wrote:
>
>> Hi,
>>
>> the sorting is pretty well optimized, it basically uses underlying
>> lucene functionality for that. there are two other important points
>> that will influence performance:
>>
>> 1) workspace configuration
>>
>> the default workspace configuration will cause initial fetching of  
>> the
>> entire result set. you can change this behavior by setting the
>> resultFetchSize parameter. See [0].
>>
>> 2) Ian wrote: "I only want to see a small number of items eg 100  
>> after
>> a particular date."
>>
>> that might actually become a problem. it will result in a range query
>> that potentially selects lots (millions?) of nodes with distinct date
>> properties. this case is not optimized. there's a new indexing
>> technique in lucene called trierange queries [1] which was
>> specifically built to perform such queries efficiently. but this is
>> not yet integrated with jackrabbit.
>>
>> I've created a JIRA issue to discuss and keep track of such an
>> enhancement in jackrabbit: [2]
>>
>> regards
>> marcel
>>
>> [0] http://issues.apache.org/jira/browse/JCR-651
>> [1]
>> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>> [2] https://issues.apache.org/jira/browse/JCR-2151
>>
>> On Wed, Jun 17, 2009 at 01:50, Ian Boston<ieb@tfd.co.uk> wrote:
>>> Hi,
>>>
>>> I want to perform a query where the full result set could be  
>>> millions of
>>> items. That set needs to be sorted by the lastModified attribute  
>>> on the
>>> node, and I only want to see a small number of items eg 100 after a
>>> particular date.
>>>
>>> If I do this, will there be scalability issues, or is the sorting  
>>> of a
>> date
>>> field optimized in the query engine ?
>>>
>>> Thanks
>>> Ian
>>>
>>


Mime
View raw message