lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Memory requirements for filters (was Re: Out of memory - CachingWrappperFilter and multiple threads)
Date Tue, 19 Feb 2008 22:06:16 GMT
Eks,

Op Tuesday 19 February 2008 21:48:03 schreef eks dev:
...
>
> >Btw. there is some room in SortedVIntList to add interval
> >coding. Normally the VInt value 0 cannot occur in the current
> >version, and this could be used as a prefix to encode a run of
> >set bits.
> >
> > I like this! I was just experimenting with
>
> int[] leftIntervalExtreme
> int[] intervalLength
> representation of interval lists, this has one nice feature, you can
> binary search left intervals for really fast long skipTo(), but has 
> somewhat  higher  memory consumption  in case bit vector gets ugly
> distributed... SortedVIntList with RLEncoding could prove more robust
> in that sense.

skipTo() on a SortedVIntList as it stands is not nice, it's a linear 
search. I'd like to add skip info to it, much like the multilevel skip
info that was added to the index not too long ago. With that
addition, skipTo() on a SortedVintList should be ok, too.
At that point, it might also be possible to split the underlying
byte[] into maximum size blocks, as Robert suggested.

> Friend of mine sent me this link, looks very 
> interesting
> http://repositories.cdlib.org/cgi/viewcontent.cgi?article=3104&contex
>t=lbnl

First impression:
Nice article, good for relational dbs, and for bitwise boolean ops.
In Lucene there is normally the need to score each matching doc
though, and for that the doc number is needed, and that does not
really fit in the data structures discussed in the article.

Regards,
Paul Elschot


> Op Tuesday 19 February 2008 12:58:34 schreef eks dev:
> > hi Mark,
> >
> > just out of curiosity, do you know the distribution of set bits  in
> > these terms you have tried to cache? maybe this simple tip could
> > help.
> > If you are lucky like we were, such terms typically used for
> > filters are good candidates to be used to sort your index before
> > indexing (once in a while) and then with some sort of
> > IntervalDocIdSet you can reduce memory requirements dramatically.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message