lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: OpenBitSet
Date Tue, 16 May 2006 19:04:26 GMT
Yeah, good hint. We actually made such measurements on TreeIntegerSet implementation, and it
is totally astonishing what you get as a result (I remember 6Meg against 2k Memory consumption
for "predominantly sorted bit vectors" like zip codes, conjuction/disjunct speed oreder of
magnitude faster as it walks shallow tree in that case). If you have any posibility to sort
your indexes, do so, even Lucene on disk representation appreciates this I guess (skips are
faster, bit vectors on disk better compressed/decompresed?) 
We even made one small visualizer of bit vectors that visualizes (generates image) HitCollector
results for any specified query (gray image where every pixel represents 8-32 succesive bits
from bit vector higher density=>darker color ). I like to see the enemy first.  
When we are allready in this area, just a curiosity,  friend of mine has one head spinning
idea, to utilize graphics card HW to do super fast bit vector operations.  These thingies
today are really optimized for basic bit operations. I am just curious to see what he comes
up with. 
I hope I will have some time next week or so to polish some tests for OpenBitSet a bit and
drop it somewhere on Jira if anybody has interest to play with.

A bit off  topic, is there anybody who is doing ChainedFilter version that uses docNrSkipper?
As I recall, you wrote BitSet version :)
----- Original Message ----
From: Chris Hostetter <>
To:; eks dev <>
Sent: Tuesday, 16 May, 2006 8:13:53 PM
Subject: Re: OpenBitSet

: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message