jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Query that sorts a large result set.
Date Fri, 19 Jun 2009 09:36:24 GMT
On Fri, Jun 19, 2009 at 10:07 AM, Marcel Reutegger <marcel.reutegger@gmx.net
> wrote:

> On Thu, Jun 18, 2009 at 23:20, Ard Schrijvers <a.schrijvers@onehippo.com>
> wrote:
> > As I like this solution, it seems to me to only suitable for dates,
> > right?
>
> yeah, it probably works best with fixed length values.


And for similar strings..so for dates, a common thing to sort on, you have
achieved 50% memory reduction which is really nice (as I think the other 50%
is retained by Lucene)


> see also the wiki page I created about this:
> http://wiki.apache.org/jackrabbit/ReduceMemOfSharedFieldCache
>
> > How do we know that we are sorting on a date...by checking
> > whethet it has length 9..or that it starts with msq?
>
> as of IndexFormatVersion V3 (jackrabbit 1.5) the property type is
> stored as a payload on the indexed term.


Great, I did not know. I haven't had time to play with payloads yet. Do they
retain memory? Do you happen to know if you can store multiple payloads on a
term? If so, it might be possible to store, say, the short_title as a
payload, and we could choose to order by a short_title (and perhaps only the
ambiguous documents having the same first 6 chars for the short_title doing
the sort on the entire title)...just thinking out loud, not sure if this is
total nonsense, as I did not look at any code.

>
> >
> > retArray[termDocs.doc()] = new
> String(term.text().substring(prefix.length()));
>
> hmm, you're right. it was actually my intention to reduce memory usage
> by only keeping the significant part of the term. we should fix that.


It is a nasty String gotcha :-). It save is not huge, but just a couple of
bytes per cached term.

Regards Ard


>
>
> regards
>  marcel
>
> > It is a bit strange, but as for dates I think the prefix.length is
> > something like "lastModified" and a delimiter, suppose 13 chars..this
> > would bring back the char array retained in memory back from 22 to
> > 9...(for dates)
> >
> > Furthermore, it follows that using short property names saves you
> > memory. This could be avoided in the end if we index each  property in
> > its own lucene field, instead of all in :_PROPERTIES and prefix the
> > value with the propertyname..this though requires quite some rewrite
> > for indexing i think.
> >
> > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622
> >
> >
> >
> > On Thu, Jun 18, 2009 at 1:25 PM, Marcel
> > Reutegger<marcel.reutegger@day.com> wrote:
> > > On Thu, Jun 18, 2009 at 09:37, Ard Schrijvers <
> a.schrijvers@onehippo.com> wrote:
> > >> If you happen to find the holy grail solution, I suppose you'll let us
> know
> > >> :-) Also if you would have some memory usage numbers with and without
> the
> > >> suggestion of mine regarding reducing the precision of you Date field,
> this
> > >> would be very valuable.
> > >
> > > hmm, I'm been thinking about a solution that I would call
> > > flyweight-substring-collation-key. it assumes that there is usually a
> > > major overlap of substrings of the the values to sort on. i.e. a
> > > lastModified value. so instead of always keeping the entire value we'd
> > > have a collation key that references multiple reusable substrings.
> > >
> > > assume we have the following values:
> > >
> > > - msqyw2shb
> > > - msqyw2t93
> > > - msqyw2u0v
> > > - msqyw2usn
> > > - msqyw2vkf
> > > - msqyw2wc7
> > > - msqyw2x3z
> > > - msqyw2xvr
> > > - msqyw2ynj
> > > - msqyw2zfb
> > >
> > > (those are date property values each 1 second after the previous one)
> > >
> > > we could create collation keys for use as comparable in the field
> > > cache like this:
> > >
> > > substring cache:
> > > [0] msq
> > > [1] shb
> > > [2] t93
> > > [3] u0v
> > > [4] usn
> > > [5] vkf
> > > [6] wc7
> > > [7] x3z
> > > [8] xvr
> > > [9] ynj
> > > [10] yw2
> > > [11] zfb
> > >
> > > and then the actual comparable that reference the substrings in the
> cache:
> > >
> > > - {0, 10, 1}
> > > - {0, 10, 2}
> > > - {0, 10, 3}
> > > - {0, 10, 4}
> > > - {0, 10, 5}
> > > - {0, 10, 6}
> > > - {0, 10, 7}
> > > - {0, 10, 8}
> > > - {0, 10, 9}
> > > - {0, 10, 11}
> > >
> > > this will result in a lower memory consumption and using the reference
> > > indexes could even speed up the comparison.
> > >
> > > a quick test with 1 million dates values showed that the memory
> > > consumption drops to 50% with this approach.
> > >
> > > regards
> > >  marcel
> > >
>

Mime
View raw message