From users-return-11642-apmail-jackrabbit-users-archive=jackrabbit.apache.org@jackrabbit.apache.org Wed Jun 17 21:33:56 2009 Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 82226 invoked from network); 17 Jun 2009 21:33:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Jun 2009 21:33:56 -0000 Received: (qmail 31120 invoked by uid 500); 17 Jun 2009 21:34:07 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 31103 invoked by uid 500); 17 Jun 2009 21:34:06 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 31092 invoked by uid 99); 17 Jun 2009 21:34:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jun 2009 21:34:06 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.18.2.26] (HELO exprod7og124.obsmtp.com) (64.18.2.26) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 17 Jun 2009 21:33:57 +0000 Received: from source ([209.85.220.224]) by exprod7ob124.postini.com ([64.18.6.12]) with SMTP ID DSNKSjlhLzhSEJLwodSPU5AcQKQFpeuEZloB@postini.com; Wed, 17 Jun 2009 14:33:37 PDT Received: by fxm24 with SMTP id 24so624774fxm.45 for ; Wed, 17 Jun 2009 14:33:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.53.72 with SMTP id l8mr538508bkg.171.1245274413294; Wed, 17 Jun 2009 14:33:33 -0700 (PDT) In-Reply-To: <9BC2FD1A-9A05-4717-9F5B-E85A5B827027@tfd.co.uk> References: <9B333D29-9580-414C-B451-8CAB9EBA6ED1@tfd.co.uk> <697f8380906170204t5e90030bp215fbadaeee1386d@mail.gmail.com> <9BC2FD1A-9A05-4717-9F5B-E85A5B827027@tfd.co.uk> Date: Wed, 17 Jun 2009 23:33:33 +0200 Message-ID: <697f8380906171433o22d8e8fbi3992b3fcac26f2ad@mail.gmail.com> Subject: Re: Query that sorts a large result set. From: Ard Schrijvers To: users@jackrabbit.apache.org Content-Type: multipart/alternative; boundary=001636c5b2d70d6571046c920bd8 X-Virus-Checked: Checked by ClamAV on apache.org --001636c5b2d70d6571046c920bd8 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hello, Now, when you want to sort in lucene, first, *all* the lastModified lucene > terms are read in memory (suppose 26 chars ~ 100 bytes and 9 chars ~ 80 >> bytes memory) >> >> 10.000.000 * 100 bytes = 1 Gb of memory in lucene terms + the jackrabbit >> SharedFieldCache will occupy another 10.000.000 * 80 bytes (+ overhead for >> nodes not having a date, which might be 90% * 4 bytes a piece) >> > > This is what I was worried about, You though have to realize that, in my computions, I assume you are just using a Calendar value which is unique for each document. Now, if you would find it ok to stick to Calendar values with only, say, minute granularity, you'll significantly reduce (perhaps you won't have the issue anymore at all...if you would use a granalarity of hour (perhaps only in the prop you want for sorting, then 10 years will at most result in 87600 terms..not to much anymore)) the number of unique lucene terms for the calendar values, decreasing memory useage a lot. Also note that some time ago, the SharedFieldCache memory consumption has decreased a lot, sse: https://issues.apache.org/jira/browse/JCR-1931 > >> Anyways, conclusion, if you have 10.000.000 nodes with lastModified, >> sorting >> on it will cost you directly 1.8 Gb, which cannot be freed by a GC, but >> will >> be lost during the rest of jvm life (untill indexes merge, but this is >> rare >> cornercase for big indices). >> > > > This makes me a bit more worried, since I thought that at least the memory > would be GC'd at the end of the request, > So presumably if the user asks for the first 100 hits sorted by > lastmodified, subject, status then; will each of those distinct searches > consume additional memory not freed at the end of the request ? This is lucene internals when you do not re-open an indexreader. > > > There are 2 problems here for us, the UX people are demanding sorting of > every column that is displayed, and we are using Sling which has a Search > servlet that accepts XPath or SQL, so I can craft a query that will generate > OOM for the JVM even if the UI is not causing the problem, we may have to > remove that servlet, if my fears a real. > Sorting on every column with millions of rows simply isn't possible atm. This is not Jackrabbit, but plain Lucene. It is really hard to sort on a title of millions of documents.....So, you might consider adding columns, with short_title to sort on. worst case, is that sorting works up to x chars...at least, it works pretty well...better than an instant OOM... Anyways, I hope to be doing some tests and investigations in the near future regarding searching millions of documents...Hopefully I am able to come up with better solutions...though, it stays a hard issue...Hope this helps at least already a bit Regards Ard --001636c5b2d70d6571046c920bd8--