lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: RLE Compressing bit vectors, just toughts
Date Sat, 04 Aug 2007 18:26:09 GMT
Hi Paul, thanks for feedback

>I suppose you mean run length encoding? (I missed the first posting about 

You are right, that is what I meant by RLE. This was the first posting. I am just trying get
some feedback to see if there are some knock-out conditions disqualifying  this idea.

a bit of background, 
It came up as a side effect after playing with Filter , better said, with your Matcher idea.
We found some *relatively* clean way to get around BitSet - Filter marriage by avoiding Filter
completely and making our own "SlicingIndexReader extends FilterIndexReader" that has ability
to receive  Matcher(filter) from outside. Also,  FilteredTermDocs implements TermDocs inside
of it to use this "Matcher" to do skipping.   We could not wait  for  your nice  code that
solves Filter problems  to get on svn. I did not bother anybody with this code as I think
it solves problem with Filter on conceptually soft ground. It provides me with capacity to
make different view on index subset defined by Matcher, but does not have flexibility your
approach with Matcher has (eg BooleanQuery.add(Matcher, Occur)...). Also, it is fast, as it
filters out at "the source", with Scorers totally unaware of it.    .... irrelevant, just
mentioning it as maybe someone finds this idea
 useful for something else. 

Back to the RLE, by doing all this, we came up with one SortedIntIntervalListMatcher  as 
we have some fields in index that compress perfectly using this trick! IntervalList is practically
 the same thing as RLE, solves  the same problem. So the idea, "can RLE save some ticks/ index
size without affecting performance in typical non-sorted case", I would say yes, but it is
good idea to ask for feedback from people more familiar with multi interval skip lists and
bit level gurus like you and Yonik,   honestly, I have no idea what would be relative cost
of an extra if(0xFF==b) in tight next() and skipTo() methods, as  this determines how big
slow down in the worst case will be (totally sparse, no RLE "firing"). 

>You could try and use a VInt flag value (how about 0xFF ?) to start an encoded 
>run length encoded series of bits. For example 0xFF would be followed by the
>next delta as a VInt, and by the run length as the next VInt.

>You might also try and generalize the bytes of VInt to nibbles (half bytes).

I will see, I need to figure out some experiments that are not as involved as implementing
it on Lucene index (change index format and whatnot..)
Anyhow,  it is encouraging  not to have  immediate "sorry, cannot work because ...."   from
 you or someone else on this list :)

thanks again for feedback,

Yahoo! Answers - Got a question? Someone out there knows the answer. Try it

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message