incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Stump <mrevilgn...@gmail.com>
Subject Re: Bitmap indexes - reviving CASSANDRA-1472
Date Fri, 12 Apr 2013 17:41:22 GMT
It looks like there is some interest so I'm going to disgorge everything
I've learned/considered in the past couple weeks just so that we have a
consistant base. I'm going to break down how the indexes work, different
optimizations and drawbacks and try to address the points/questions that
people have raised. I've broken it down by subject.

Basic Equality:
Bitmap indexes are essentially giant arrays of bytes, with one bit per
possibility. If you want to know what rows have boolean value "event1" set
to true then you set the address of those rows to 1 in the index. For
example in index 0100100101, this would mean rows 1, 4, 7 and 9 would
contain "event1". If you want to know which rows contain event1 and event2
then you do a bitwise AND over the two indexes to get the set intersection
eg. 00100101 AND 10000101 results in 00000101. From this you can build
complex queries by doing simple set arithmetic. Each of these sets are
called a query dimension.

Range Queries:
If you want to encode ranges such as give me all users who have a counter
in the integer interval [0, 2] then you need a two dimensional bitmap
index. The first dimension is what values between [0, 7] have been hit:
10010011, the second dimension is which rows for each of those possible
values contain the value. So for value 0 there would be another index
00100010, which means that rows 3 and 7 contain value 0.  This forms a
giant two dimensional array.

[1] [00100010]
[0] [00001010]
[0] [10100110]
[1] [00100010]
[0] [00101010]
[0] [01101010]
[0] [00111110]
[1] [00100110]
[1] [00100000]

To figure out the answer to who has counter value [0, 2] would be the union
of the sets  [00100010], [00001010], [10100110] which is [10101110].

Binning:
Each index has an address size which limits the total number of rows/values
you can encode in an index. So if you have a 32bit index out can encode a
total of 4,294,967,295 positions. If you need to encode more values than
what is possible in that address space you can do two things. Increase the
address space or perform binning. Binning is essentially hash collision,
meaning two or more values are assigned to the same value. For example if
you wanted to index floats you could use the same index as above but if you
want to know the rows who contain the real numbers [0,1.9] then you would
need to check the actual value for the rows in the result set. This is
called a candidate check which is very expensive, often the most expensive
part of a query.

Segments:
If you increase the address size of the index then you run into
space/memory problems. A 32 bit index is 4 GB when fully materialized, a 64
bit index is 2 PB. To solve this problem you use segments/partitions or
compression. Segments work through the use of a sparse table
or interval list. You break the index up into chunks of equal size lets say
1kb. If an bit gets flipped at position 4098, then you go to segment 0x02
and if it exists you flip that bit. If that segment doesn't exist then you
create it and set the bit. The advantage is that you only have to create
segments that contain flipped bits. The downside is that if you have a wide
distribution of bits flipped you end up with many segments with one or two
bits flipped and the is empty space or wasted memory.

Compression:
The other approach to use compression. The trick is that you need a
compression algorithm that doesn't require decompression before doing the
bitwise operations. This means you need a run length encoding (RLE). There
are 3 leading contenders for king of the hill, CONCISE which is used by
Druid, WAH which is used by Fastbit, and PWAH which isn't used by anybody I
think yet. They all work the same way which is to encode store large blocks
of zeros or ones as encoded values.  So you get this index taken from PWAH
[101 10000000001011 111010000100101 000000000010010] which means 11 blocks
of 1 bits a literal block of 15 bits and 18 blocks of 0 bits. The downside
to this approach is that you lose the ability to index directly to a
position in the index. If you want to perform an update you've either got
to read/decode to that position and split the index or rebuild the entire
index. This is why Druid, and Fastbit are very good for static data sets
but can't deal with fast changing data.

A hybrid approach to performance:
You can take a hybrid approach to performance which means combine segments
and compression. You break the index address space up into segments and
then perform compression within each of those segments to negate the
problem of many segments with only 1 bit flipped but consuming the entire
segment size worth of memory. This also limits the cost of having to split
the encoding for any compressed segment. As segments fill up you can
compress and join them together using the standard SSTable or levelDB
approach.

Distributing work across the ring:
If you use segments and a uniform index size that means you can assign
segments of the index to different portions of the C* ring. This means that
one node would have all the segments necessary to answer queries involving
that portion of the address space. If this also corresponds to the rows it
has, It means it could answer with certainty that rows in it's address
space match the query, and to get the entire query results you just merge
the segments to obtain the final set.

Other optimizations possibly worth considering:
If you know the size of each index beforehand you can perform the bitwise
operations across indexes with the indexes sorted by increasing size.  If
you know that the index for event1 only has 1 segment populated that means
you only need to look at that corresponding segment for the other indexes
even if those indexes are a fully materialized. There are also some other
bitmap index types that more efficiently encode range but I don't know
enough about them yet to render an opinion.

Other optimizations not worth considering:
There are some other papers on limiting the IO cost of candidate checks
through optimal binning strategies, but this requires modeling or having
full knowledge of the distribution of your data which isn't feasible.

Query planning/limitations:
To be honest I haven't dug into the existing query planner for CQL yet so I
can't answer to Jonathan's point about it's complexity. What I can say is
that the query logic for bitmap indexes is pretty simple. You just need to
break it down into a set of boolean operations over the selected sets, and
it can be performed in parallel by performing the same operations
over adjacent segments. If you have knowledge about about segment size then
you can perform the operations in order to limit the operation size. To
simply things I would not allow indexing of CQL3 collections. My temptation
would be to ship with 2D indexes allowing support for range queries (also I
need it for my use case), but maybe that comes in another iteration.
Trigram support, fuzzy matching and regex just get layered on top of these
fundamental concepts, and you can steal the hard bits from existing
libraries/projects.

Further research:
Some thought would need to be put into what are the appropriate segment
size, index size, binning strategy, and hash selection for encoding
non-numeric values.

That's basically it. I'm sure I've forgotten something. I'm really looking
forward to any feedback, questions, or criticisms that you may have. I
would really like to see something like this in C*, and I'm offering up my
dedicated time to make it happen if I can make the goals/timeline of
my employer and the project meet up. I've already done things such as put
the book I was writing on hold in order to make time. I understand if you
say no, you do have the greater project to consider. If you do say no then
we will most likely continue with this project as a separate query/index
engine and C* will just act as the storage layer. If we go alone the
resulting code may or may not be open source; it hasn't been decided yet.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message