Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 39000 invoked from network); 19 Jun 2009 07:35:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jun 2009 07:35:34 -0000 Received: (qmail 5359 invoked by uid 500); 19 Jun 2009 07:35:44 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 5326 invoked by uid 500); 19 Jun 2009 07:35:44 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 5315 invoked by uid 99); 19 Jun 2009 07:35:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2009 07:35:44 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.18.2.210] (HELO exprod7og127.obsmtp.com) (64.18.2.210) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 19 Jun 2009 07:35:35 +0000 Received: from source ([209.85.218.218]) by exprod7ob127.postini.com ([64.18.6.12]) with SMTP ID DSNKSjs/sXc8j/qaLhctPWe70Zvm85wLfAoV@postini.com; Fri, 19 Jun 2009 00:35:15 PDT Received: by bwz18 with SMTP id 18so1505240bwz.3 for ; Fri, 19 Jun 2009 00:35:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.113.136 with SMTP id a8mr1899344faq.101.1245396912633; Fri, 19 Jun 2009 00:35:12 -0700 (PDT) In-Reply-To: References: <9B333D29-9580-414C-B451-8CAB9EBA6ED1@tfd.co.uk> <697f8380906170204t5e90030bp215fbadaeee1386d@mail.gmail.com> <9BC2FD1A-9A05-4717-9F5B-E85A5B827027@tfd.co.uk> <697f8380906171433o22d8e8fbi3992b3fcac26f2ad@mail.gmail.com> <8014BEDB-9C42-405F-8AC4-4EFE67CE147C@tfd.co.uk> <697f8380906180037w441f2338tb5a59c7889c16b98@mail.gmail.com> <697f8380906181420h5852dc3td5e0c46b0ee90ba8@mail.gmail.com> Date: Fri, 19 Jun 2009 09:35:12 +0200 Message-ID: <697f8380906190035r340b2bebocb6146200f3555bd@mail.gmail.com> Subject: Re: Query that sorts a large result set. From: Ard Schrijvers To: users@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Fri, Jun 19, 2009 at 9:25 AM, Marcel Reutegger wrote: > Hi Ard, > > I think this discussion rather belongs to the dev list. Yes you are right.. :-) Ard > > I'll reply there... > > regards > =A0marcel > > On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers = wrote: >> Hello Marcel, >> >> As I like this solution, it seems to me to only suitable for dates, >> right? How do we know that we are sorting on a date...by checking >> whethet it has length 9..or that it starts with msq? Furthermore, I am >> quite curious how you implemented this below. If you just used >> substrings, we could gain quite a bit more with, but i am not sure >> whether you already do this: >> >> Suppose >> >> String s =3D "msqyw2shb"; >> >> If you are having >> >> String[0] =3D s.subString(0,3); >> >> we reduce memory usage quite a bit more with >> >> String[0] =3D new String(s.subString(0,3)) >> >> Also see [1]. But perhaps you are already doing this. >> >> A direct small improvement we could directly make is replacing : >> >> retArray[termDocs.doc()] =3D term.text().substring(prefix.length()); >> >> with >> >> retArray[termDocs.doc()] =3D new String(term.text().substring(prefix.len= gth())); >> >> It is a bit strange, but as for dates I think the prefix.length is >> something like "lastModified" and a delimiter, suppose 13 chars..this >> would bring back the char array retained in memory back from 22 to >> 9...(for dates) >> >> Furthermore, it follows that using short property names saves you >> memory. This could be avoided in the end if we index each =A0property in >> its own lucene field, instead of all in :_PROPERTIES and prefix the >> value with the propertyname..this though requires quite some rewrite >> for indexing i think. >> >> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=3D4513622 >> >> >> >> On Thu, Jun 18, 2009 at 1:25 PM, Marcel >> Reutegger wrote: >>> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers wrote: >>>> If you happen to find the holy grail solution, I suppose you'll let us= know >>>> :-) Also if you would have some memory usage numbers with and without = the >>>> suggestion of mine regarding reducing the precision of you Date field,= this >>>> would be very valuable. >>> >>> hmm, I'm been thinking about a solution that I would call >>> flyweight-substring-collation-key. it assumes that there is usually a >>> major overlap of substrings of the the values to sort on. i.e. a >>> lastModified value. so instead of always keeping the entire value we'd >>> have a collation key that references multiple reusable substrings. >>> >>> assume we have the following values: >>> >>> - msqyw2shb >>> - msqyw2t93 >>> - msqyw2u0v >>> - msqyw2usn >>> - msqyw2vkf >>> - msqyw2wc7 >>> - msqyw2x3z >>> - msqyw2xvr >>> - msqyw2ynj >>> - msqyw2zfb >>> >>> (those are date property values each 1 second after the previous one) >>> >>> we could create collation keys for use as comparable in the field >>> cache like this: >>> >>> substring cache: >>> [0] msq >>> [1] shb >>> [2] t93 >>> [3] u0v >>> [4] usn >>> [5] vkf >>> [6] wc7 >>> [7] x3z >>> [8] xvr >>> [9] ynj >>> [10] yw2 >>> [11] zfb >>> >>> and then the actual comparable that reference the substrings in the cac= he: >>> >>> - {0, 10, 1} >>> - {0, 10, 2} >>> - {0, 10, 3} >>> - {0, 10, 4} >>> - {0, 10, 5} >>> - {0, 10, 6} >>> - {0, 10, 7} >>> - {0, 10, 8} >>> - {0, 10, 9} >>> - {0, 10, 11} >>> >>> this will result in a lower memory consumption and using the reference >>> indexes could even speed up the comparison. >>> >>> a quick test with 1 million dates values showed that the memory >>> consumption drops to 50% with this approach. >>> >>> regards >>> =A0marcel >>> >> >