hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/IndexDev/Bitmap" by MarquisWang
Date Sat, 20 Nov 2010 02:39:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/IndexDev/Bitmap" page has been changed by MarquisWang.


New page:
= Bitmap Indexing =


== Introduction ==

This document explains the proposed design for adding a bitmap index handler (https://issues.apache.org/jira/browse/HIVE-1803).
Bitmap indexing (http://en.wikipedia.org/wiki/Bitmap_index) is a standard technique for indexing
columns with few distinct 
values, such as gender.

== Approach ==

We want to develop a bitmap index that can reuse as much of the existing Compact Index code
as possible. 

== Proposal ==

=== First implementation ===

This implementation confers some of the benefits of bitmap indexing and should be easy to
implement given the already existing compact index, but it does few of the optimizations such
as compression that a really good bitmap index should do.

Like the complex index, this implementation uses an index table. The index table on a column
"key" has three columns, _bucketname, _offset, and _bitmaps. _bucketname is a string pointing
to the hadoop file that is storing this block in the table, _offset is the block offset of
a block, and _bitmaps is a Map where the keys are all the values of the column "key" that
exist in this block and a bitmap encoding (an Array of BigInts??) of every row in that block,
with a 1 if that row has the value, 0 if not. If a key value does not appear in a block at
all, the value is not stored in the map.

When querying this index, we select each filename, block pair where the _bitmaps Map has a
key that is the queried key value. If there are boolean AND or OR operations done on the predicates
with bitmap indexes, we can use bitwise operations to try to eliminate blocks as well. We
can use this data to generate the filename, array of block offsets format that the compact
index handler uses and reuse that in the bitmap index query.

=== Second iteration ===

The basic implementation's only compression is eliminating blocks where all rows are 0s. This
is unlikely to happen for larger blocks, so we need a better compression format. What we can
do is do byte-aligned bitmap compression, where the bitmap is an array of bytes, and a byte
of all 1s or all 0s implies one or more bytes where every value is 0 or 1. Then, we would
just need to add another column in the bitmap index table that is an array of Ints that describe
how long the gaps are and logic to expand the compression.

View raw message