jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Query that sorts a large result set.
Date Fri, 19 Jun 2009 07:25:01 GMT
Hi Ard,

I think this discussion rather belongs to the dev list.

I'll reply there...

regards
 marcel

On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers<a.schrijvers@onehippo.com> wrote:
> Hello Marcel,
>
> As I like this solution, it seems to me to only suitable for dates,
> right? How do we know that we are sorting on a date...by checking
> whethet it has length 9..or that it starts with msq? Furthermore, I am
> quite curious how you implemented this below. If you just used
> substrings, we could gain quite a bit more with, but i am not sure
> whether you already do this:
>
> Suppose
>
> String s = "msqyw2shb";
>
> If you are having
>
> String[0] = s.subString(0,3);
>
> we reduce memory usage quite a bit more with
>
> String[0] = new String(s.subString(0,3))
>
> Also see [1]. But perhaps you are already doing this.
>
> A direct small improvement we could directly make is replacing :
>
> retArray[termDocs.doc()] = term.text().substring(prefix.length());
>
> with
>
> retArray[termDocs.doc()] = new String(term.text().substring(prefix.length()));
>
> It is a bit strange, but as for dates I think the prefix.length is
> something like "lastModified" and a delimiter, suppose 13 chars..this
> would bring back the char array retained in memory back from 22 to
> 9...(for dates)
>
> Furthermore, it follows that using short property names saves you
> memory. This could be avoided in the end if we index each  property in
> its own lucene field, instead of all in :_PROPERTIES and prefix the
> value with the propertyname..this though requires quite some rewrite
> for indexing i think.
>
> [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622
>
>
>
> On Thu, Jun 18, 2009 at 1:25 PM, Marcel
> Reutegger<marcel.reutegger@day.com> wrote:
>> On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <a.schrijvers@onehippo.com> wrote:
>>> If you happen to find the holy grail solution, I suppose you'll let us know
>>> :-) Also if you would have some memory usage numbers with and without the
>>> suggestion of mine regarding reducing the precision of you Date field, this
>>> would be very valuable.
>>
>> hmm, I'm been thinking about a solution that I would call
>> flyweight-substring-collation-key. it assumes that there is usually a
>> major overlap of substrings of the the values to sort on. i.e. a
>> lastModified value. so instead of always keeping the entire value we'd
>> have a collation key that references multiple reusable substrings.
>>
>> assume we have the following values:
>>
>> - msqyw2shb
>> - msqyw2t93
>> - msqyw2u0v
>> - msqyw2usn
>> - msqyw2vkf
>> - msqyw2wc7
>> - msqyw2x3z
>> - msqyw2xvr
>> - msqyw2ynj
>> - msqyw2zfb
>>
>> (those are date property values each 1 second after the previous one)
>>
>> we could create collation keys for use as comparable in the field
>> cache like this:
>>
>> substring cache:
>> [0] msq
>> [1] shb
>> [2] t93
>> [3] u0v
>> [4] usn
>> [5] vkf
>> [6] wc7
>> [7] x3z
>> [8] xvr
>> [9] ynj
>> [10] yw2
>> [11] zfb
>>
>> and then the actual comparable that reference the substrings in the cache:
>>
>> - {0, 10, 1}
>> - {0, 10, 2}
>> - {0, 10, 3}
>> - {0, 10, 4}
>> - {0, 10, 5}
>> - {0, 10, 6}
>> - {0, 10, 7}
>> - {0, 10, 8}
>> - {0, 10, 9}
>> - {0, 10, 11}
>>
>> this will result in a lower memory consumption and using the reference
>> indexes could even speed up the comparison.
>>
>> a quick test with 1 million dates values showed that the memory
>> consumption drops to 50% with this approach.
>>
>> regards
>>  marcel
>>
>

Mime
View raw message