Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <a781e7950906190025s50a1a916vdfda31aa76aa0099@mail.gmail.com>
References: <9B333D29-9580-414C-B451-8CAB9EBA6ED1@tfd.co.uk>
	 <a781e7950906170113h364b05f2i486e72e92ad11263@mail.gmail.com>
	 <697f8380906170204t5e90030bp215fbadaeee1386d@mail.gmail.com>
	 <9BC2FD1A-9A05-4717-9F5B-E85A5B827027@tfd.co.uk>
	 <697f8380906171433o22d8e8fbi3992b3fcac26f2ad@mail.gmail.com>
	 <8014BEDB-9C42-405F-8AC4-4EFE67CE147C@tfd.co.uk>
	 <697f8380906180037w441f2338tb5a59c7889c16b98@mail.gmail.com>
	 <a781e7950906180425g49ae749eja65a052f2c2a2c75@mail.gmail.com>
	 <697f8380906181420h5852dc3td5e0c46b0ee90ba8@mail.gmail.com>
	 <a781e7950906190025s50a1a916vdfda31aa76aa0099@mail.gmail.com>
Date: Fri, 19 Jun 2009 09:35:12 +0200
Message-ID: <697f8380906190035r340b2bebocb6146200f3555bd@mail.gmail.com>
Subject: Re: Query that sorts a large result set.
From: Ard Schrijvers <a.schrijvers@onehippo.com>
To: users@jackrabbit.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 19, 2009 at 9:25 AM, Marcel
Reutegger<marcel.reutegger@gmx.net> wrote:
> Hi Ard,
>
> I think this discussion rather belongs to the dev list.

Yes you are right.. :-)

Ard

>
> I'll reply there...
>
> regards
> =A0marcel
>
> On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers<a.schrijvers@onehippo.com> =
wrote:
>> Hello Marcel,
>>
>> As I like this solution, it seems to me to only suitable for dates,
>> right? How do we know that we are sorting on a date...by checking
>> whethet it has length 9..or that it starts with msq? Furthermore, I am
>> quite curious how you implemented this below. If you just used
>> substrings, we could gain quite a bit more with, but i am not sure
>> whether you already do this:
>>
>> Suppose
>>
>> String s =3D "msqyw2shb";
>>
>> If you are having
>>
>> String[0] =3D s.subString(0,3);
>>
>> we reduce memory usage quite a bit more with
>>
>> String[0] =3D new String(s.subString(0,3))
>>
>> Also see [1]. But perhaps you are already doing this.
>>
>> A direct small improvement we could directly make is replacing :
>>
>> retArray[termDocs.doc()] =3D term.text().substring(prefix.length());
>>
>> with
>>
>> retArray[termDocs.doc()] =3D new String(term.text().substring(prefix.len=
gth()));
>>
>> It is a bit strange, but as for dates I think the prefix.length is
>> something like "lastModified" and a delimiter, suppose 13 chars..this
>> would bring back the char array retained in memory back from 22 to
>> 9...(for dates)
>>
>> Furthermore, it follows that using short property names saves you
>> memory. This could be avoided in the end if we index each =A0property in
>> its own lucene field, instead of all in :_PROPERTIES and prefix the
>> value with the propertyname..this though requires quite some rewrite
>> for indexing i think.
>>
>> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=3D4513622
>>
>>
>>
>> On Thu, Jun 18, 2009 at 1:25 PM, Marcel
>> Reutegger<marcel.reutegger@day.com> wrote:
>>> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <a.schrijvers@onehippo.co=
m> wrote:
>>>> If you happen to find the holy grail solution, I suppose you'll let us=
 know
>>>> :-) Also if you would have some memory usage numbers with and without =
the
>>>> suggestion of mine regarding reducing the precision of you Date field,=
 this
>>>> would be very valuable.
>>>
>>> hmm, I'm been thinking about a solution that I would call
>>> flyweight-substring-collation-key. it assumes that there is usually a
>>> major overlap of substrings of the the values to sort on. i.e. a
>>> lastModified value. so instead of always keeping the entire value we'd
>>> have a collation key that references multiple reusable substrings.
>>>
>>> assume we have the following values:
>>>
>>> - msqyw2shb
>>> - msqyw2t93
>>> - msqyw2u0v
>>> - msqyw2usn
>>> - msqyw2vkf
>>> - msqyw2wc7
>>> - msqyw2x3z
>>> - msqyw2xvr
>>> - msqyw2ynj
>>> - msqyw2zfb
>>>
>>> (those are date property values each 1 second after the previous one)
>>>
>>> we could create collation keys for use as comparable in the field
>>> cache like this:
>>>
>>> substring cache:
>>> [0] msq
>>> [1] shb
>>> [2] t93
>>> [3] u0v
>>> [4] usn
>>> [5] vkf
>>> [6] wc7
>>> [7] x3z
>>> [8] xvr
>>> [9] ynj
>>> [10] yw2
>>> [11] zfb
>>>
>>> and then the actual comparable that reference the substrings in the cac=
he:
>>>
>>> - {0, 10, 1}
>>> - {0, 10, 2}
>>> - {0, 10, 3}
>>> - {0, 10, 4}
>>> - {0, 10, 5}
>>> - {0, 10, 6}
>>> - {0, 10, 7}
>>> - {0, 10, 8}
>>> - {0, 10, 9}
>>> - {0, 10, 11}
>>>
>>> this will result in a lower memory consumption and using the reference
>>> indexes could even speed up the comparison.
>>>
>>> a quick test with 1 million dates values showed that the memory
>>> consumption drops to 50% with this approach.
>>>
>>> regards
>>> =A0marcel
>>>
>>
>