lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: OpenBitSet
Date Sun, 14 May 2006 18:15:48 GMT
  It is faster than BitSet, even against Mustang. The numbers are a bit less than on Yonik’s
HW, but quite convincing. I did small test on my XP Notebook (Pentium M 1.6GHz). Only “union”
test is some 20% slower on 8Mio size with 80k bits set. I did not dig deeper. 
  As much as it is worth, my proposal would be to hide all usages of BitSet (there is also
one BitVector around?) inside Lucene behind appropriate interfaces in order to leverage pearls
like this one. Actually it should not be too complicated, the only ugly back compatibility
issue I could identify is in Filter, but for this Yonik, Hoss and Paul already made rather
acceptable extend/deprecate plans.
  Maybe separate package for various BitSet / IntegerSet implementation would not be such
a bad idea as there is no single best implementation? Just let me remind on what we have around:
  BitSet (OpenBitSet as faster variant of the more or less the same)
  TreeBasedIntSet (I am trying to make my own implementation inspired by
  I could easily imagine to put all these together behind simple interfaces and to provide
some Factories and a few utility methods with a bit of intelligence (for example, CheekyBitSet
that collects document ids in some small int[] and if it fits, keep them there, if not switch
automatically to some compact representation)…
  Implementations above cover practically all important cases for caching/Filtering and bring
some extra speed/memory for other uses: 
  OpenBitSet (referent, “general workhorse”): 
  Fast random access, does not grow, high memory demands for sparse BitsSets, reasonably fast
Iterator over set bits
       Slower random access, very fast Iterator possible, low memory demands on very sparse
       Slow random access, extremely fast Iterator over set bits, low memory demands on sparse
bit sets, fast populating if adding elements that are sorted… 
    Almost the same as IntArraySotredIntList, but random access almost not possible, but Memory
demands very low, very useful for moderately sparse bit sets. 
     The best general performance for sparse bit sets, grows dynamically, exploits runs of
clear/set bits to reduce memory demands. Extremly fast Iterator over set  bits  for low to
medium sparse bit sets, acceptable random access. 
  Considering Zipf distribution curse in almost all cases I have ever seen (except for “pure”
filtering fields like language, categories…), having new BitSet(MAX_DOCUMENT) looks like
pure luxury (e.g. in ChainedFilter). As you all know Big majority of bit vectors usually met
in practice are very to modestly sparse due to Zipf… 
  Did I say it before, this code Yonik wrote is way above my head, I just tried to summarize
a few thoughts from “library user” perspective as it seams to me that caching and this
bit twiddling is next major performance/scalability milestone for Lucene.  

----- Original Message ----
From: Yonik Seeley <>
Sent: Friday, 12 May, 2006 10:29:24 PM
Subject: Re: OpenBitSet

Code is here for those interested:

-Yonik Solr, the open-source Lucene search server

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message