From Tom Brown <tombrow...@gmail.com>
Subject More tables, or add a prefix to each row key?
Date Thu, 19 Apr 2012 01:01:40 GMT

I'm writing an OLAP cube database and I can implement the storage in
one of two schemas, and I don't know if there's any unexpected
performance trade-offs I'm not aware of.

Each row represents a unique cell in the cube, with about 5 columns
for each row. The row key format is a set of attributes joined by a
delimiter. To represent a cube cell that contains rolled-up data, I
use fewer attributes in the row key.

For example, if a, b, and c are attributes, a single new piece of
information with all those attributes could affect rows "a-b-c",
"a-b", and "a".

When I perform a scan, I append "$" and "~" onto the desired start and
end keys (respectively) so I can be sure to get all the data in
between. When I want to get the most specific data, I can scan from
"a-b-c" to "a-b-c" and retrieve a single row. But a problem occurs
when rows with different types of attributes are interleaved when put
in order.

For example, if "a1", and "a2" are possible values for attribute "a",
the following sets of rows are in a bad order for efficient scans:


When I want to retrieve data for all rows of "a-b-c", I have to scan
from "a1-b-c" until after "a2-b-c", and skip the irrelevant rows.

The two possible solutions are to either put each data style in it's
own table (e.g. table "a-b-c" would contain "a1-b-c" and "a2-b-c",
table "a-b" would contain "a1-b" and "a2-b"), or to prefix each row
key with data that forces it into a useful order:


Placing each style of row into it's own table seems to be the ideal
solution, but I don't understand the internal HBase architecture
enough to know for sure. Will the overhead of having extra tables (if
there is any) outweigh the overhead of adding the prefix to each row
key? Is there a better way to structure my solution?

Thanks in advance!


