jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Query that sorts a large result set.
Date Wed, 17 Jun 2009 21:33:33 GMT

      Now, when you want to sort in lucene, first, *all* the lastModified

> terms are read in memory (suppose 26 chars ~ 100 bytes and  9 chars ~ 80
>> bytes memory)
>> 10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the jackrabbit
>> SharedFieldCache will occupy another 10.000.000 * 80 bytes (+ overhead for
>> nodes not having a date, which might be 90% * 4 bytes a piece)
> This is what I was worried about,

You though have to realize that, in my computions, I assume you are just
using a Calendar value which is unique for each document. Now, if you would
find it ok to stick to Calendar values with only, say, minute granularity,
you'll significantly reduce (perhaps you won't have the issue anymore at
all...if you would use a granalarity of hour (perhaps only in the prop you
want for sorting, then 10 years will at most result in 87600 terms..not to
much anymore)) the number of unique lucene terms for the calendar values,
decreasing memory useage a lot.

Also note that some time ago, the SharedFieldCache memory consumption has
decreased a lot, sse:


>> Anyways, conclusion, if you have 10.000.000 nodes with lastModified,
>> sorting
>> on it will cost you directly 1.8 Gb, which cannot be freed by a GC, but
>> will
>> be lost during the rest of jvm life (untill indexes merge, but this is
>> rare
>> cornercase for big indices).
> This makes me a bit more worried, since I thought that at least the memory
> would be GC'd at the end of the request,
> So presumably if the user asks for the first 100 hits sorted by
> lastmodified, subject, status then; will each of those distinct searches
> consume additional memory not freed at the end of the request ?

This is lucene internals when you do not re-open an indexreader.

> There are 2 problems here for us, the UX people are demanding sorting of
> every column that is displayed, and we are using Sling which has a Search
> servlet that accepts XPath or SQL, so I can craft a query that will generate
> OOM for the JVM even if the UI is not causing the problem, we may have to
> remove that servlet, if my fears a real.

Sorting on every column with millions of rows simply isn't possible atm.
This is not Jackrabbit, but plain Lucene. It is really hard to sort on a
title of millions of documents.....So, you might consider adding columns,
with short_title to sort on. worst case, is that sorting works up to x
chars...at least, it works pretty well...better than an instant OOM...

Anyways, I hope to be doing some tests and investigations in the near future
regarding searching millions of documents...Hopefully I am able to come up
with better solutions...though, it stays a hard issue...Hope this helps at
least already a bit

Regards Ard

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message