incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Stump <>
Subject Re: Bitmap indexes - reviving CASSANDRA-1472
Date Tue, 16 Apr 2013 04:08:57 GMT
I spent some time this afternoon thinking about ways forward. I need to
make progress regardless of whether or not my eventual work makes it into
C*. In order to do so, I was thinking about creating an index management
library and query engine in C++. Because of the nature of bitmap indexes
it's ok if the engine on any node spits out partial results with a list of
indexes it needs in order to complete the query. It would be up to the
caller to get information to the index library, and shuttle around partial
bitmaps if necessary.

I like this plan for a couple reasons. I get tight the tight control over
memory and the performance of C++. I can wrap a C++ library and make it
accessible to C* via JNA. If the work doesn't make it into C* I can just
wrap it in riak_core and still meet my goals. A library fills an empty
ecosystem niche between full blown databases and collections.

Unknowns areas of risk in my mind are that I don't know enough about the
implementation of indexes and the query planner in C* to know whether or
not this is crazy. I've read through the old patches for the aborted bitmap
index attempt and it looks pretty simple, but I lack context which is hard
to gather from old disembodied code. Are there any docs for the query
planner or indexes other than code and the bug DB?

I spent some time benchmarking over the weekend and my prototype is capable
of answering 3 clause queries for fully materialsed 4GB uncompressed
indexes (4B encoded values) in ~240 ms on my laptop. I'm fairly certain I'm
running into memory bandwidth limitations. There might be one or two small
tweaks still available to me, but they're all OS and hardware dependent.

I would love any feedback, even if it's to say "Matt, you're crazy go home".

Matt Stump

On Fri, Apr 12, 2013 at 10:41 AM, Matt Stump <> wrote:

> It looks like there is some interest so I'm going to disgorge everything
> I've learned/considered in the past couple weeks just so that we have a
> consistant base. I'm going to break down how the indexes work, different
> optimizations and drawbacks and try to address the points/questions that
> people have raised. I've broken it down by subject.
> Basic Equality:
> Bitmap indexes are essentially giant arrays of bytes, with one bit per
> possibility. If you want to know what rows have boolean value "event1" set
> to true then you set the address of those rows to 1 in the index. For
> example in index 0100100101, this would mean rows 1, 4, 7 and 9 would
> contain "event1". If you want to know which rows contain event1 and event2
> then you do a bitwise AND over the two indexes to get the set intersection
> eg. 00100101 AND 10000101 results in 00000101. From this you can build
> complex queries by doing simple set arithmetic. Each of these sets are
> called a query dimension.
> Range Queries:
> If you want to encode ranges such as give me all users who have a counter
> in the integer interval [0, 2] then you need a two dimensional bitmap
> index. The first dimension is what values between [0, 7] have been hit:
> 10010011, the second dimension is which rows for each of those possible
> values contain the value. So for value 0 there would be another index
> 00100010, which means that rows 3 and 7 contain value 0.  This forms a
> giant two dimensional array.
> [1] [00100010]
> [0] [00001010]
> [0] [10100110]
> [1] [00100010]
> [0] [00101010]
> [0] [01101010]
> [0] [00111110]
> [1] [00100110]
> [1] [00100000]
> To figure out the answer to who has counter value [0, 2] would be the
> union of the sets  [00100010], [00001010], [10100110] which is [10101110].
> Binning:
> Each index has an address size which limits the total number of
> rows/values you can encode in an index. So if you have a 32bit index out
> can encode a total of 4,294,967,295 positions. If you need to encode more
> values than what is possible in that address space you can do two things.
> Increase the address space or perform binning. Binning is essentially
> hash collision, meaning two or more values are assigned to the same value.
> For example if you wanted to index floats you could use the same index as
> above but if you want to know the rows who contain the real numbers [0,1.9]
> then you would need to check the actual value for the rows in the result
> set. This is called a candidate check which is very expensive, often the
> most expensive part of a query.
> Segments:
> If you increase the address size of the index then you run into
> space/memory problems. A 32 bit index is 4 GB when fully materialized, a 64
> bit index is 2 PB. To solve this problem you use segments/partitions or
> compression. Segments work through the use of a sparse table
> or interval list. You break the index up into chunks of equal size lets say
> 1kb. If an bit gets flipped at position 4098, then you go to segment 0x02
> and if it exists you flip that bit. If that segment doesn't exist then you
> create it and set the bit. The advantage is that you only have to create
> segments that contain flipped bits. The downside is that if you have a wide
> distribution of bits flipped you end up with many segments with one or two
> bits flipped and the is empty space or wasted memory.
> Compression:
> The other approach to use compression. The trick is that you need a
> compression algorithm that doesn't require decompression before doing the
> bitwise operations. This means you need a run length encoding (RLE). There
> are 3 leading contenders for king of the hill, CONCISE which is used by
> Druid, WAH which is used by Fastbit, and PWAH which isn't used by anybody I
> think yet. They all work the same way which is to encode store large blocks
> of zeros or ones as encoded values.  So you get this index taken from PWAH
> [101 10000000001011 111010000100101 000000000010010] which means 11
> blocks of 1 bits a literal block of 15 bits and 18 blocks of 0 bits. The
> downside to this approach is that you lose the ability to index directly to
> a position in the index. If you want to perform an update you've either got
> to read/decode to that position and split the index or rebuild the entire
> index. This is why Druid, and Fastbit are very good for static data sets
> but can't deal with fast changing data.
> A hybrid approach to performance:
> You can take a hybrid approach to performance which means combine segments
> and compression. You break the index address space up into segments and
> then perform compression within each of those segments to negate the
> problem of many segments with only 1 bit flipped but consuming the entire
> segment size worth of memory. This also limits the cost of having to split
> the encoding for any compressed segment. As segments fill up you can
> compress and join them together using the standard SSTable or levelDB
> approach.
> Distributing work across the ring:
> If you use segments and a uniform index size that means you can assign
> segments of the index to different portions of the C* ring. This means that
> one node would have all the segments necessary to answer queries involving
> that portion of the address space. If this also corresponds to the rows it
> has, It means it could answer with certainty that rows in it's address
> space match the query, and to get the entire query results you just merge
> the segments to obtain the final set.
> Other optimizations possibly worth considering:
> If you know the size of each index beforehand you can perform the bitwise
> operations across indexes with the indexes sorted by increasing size.  If
> you know that the index for event1 only has 1 segment populated that means
> you only need to look at that corresponding segment for the other indexes
> even if those indexes are a fully materialized. There are also some other
> bitmap index types that more efficiently encode range but I don't know
> enough about them yet to render an opinion.
> Other optimizations not worth considering:
> There are some other papers on limiting the IO cost of candidate checks
> through optimal binning strategies, but this requires modeling or having
> full knowledge of the distribution of your data which isn't feasible.
> Query planning/limitations:
> To be honest I haven't dug into the existing query planner for CQL yet so
> I can't answer to Jonathan's point about it's complexity. What I can say is
> that the query logic for bitmap indexes is pretty simple. You just need to
> break it down into a set of boolean operations over the selected sets, and
> it can be performed in parallel by performing the same operations
> over adjacent segments. If you have knowledge about about segment size then
> you can perform the operations in order to limit the operation size. To
> simply things I would not allow indexing of CQL3 collections. My temptation
> would be to ship with 2D indexes allowing support for range queries (also I
> need it for my use case), but maybe that comes in another iteration.
> Trigram support, fuzzy matching and regex just get layered on top of these
> fundamental concepts, and you can steal the hard bits from existing
> libraries/projects.
> Further research:
> Some thought would need to be put into what are the appropriate segment
> size, index size, binning strategy, and hash selection for encoding
> non-numeric values.
> That's basically it. I'm sure I've forgotten something. I'm really looking
> forward to any feedback, questions, or criticisms that you may have. I
> would really like to see something like this in C*, and I'm offering up my
> dedicated time to make it happen if I can make the goals/timeline of
> my employer and the project meet up. I've already done things such as put
> the book I was writing on hold in order to make time. I understand if you
> say no, you do have the greater project to consider. If you do say no then
> we will most likely continue with this project as a separate query/index
> engine and C* will just act as the storage layer. If we go alone the
> resulting code may or may not be open source; it hasn't been decided yet.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message