hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asaf Mesika <asaf.mes...@gmail.com>
Subject Re: How to design a data warehouse in HBase?
Date Sat, 15 Dec 2012 02:14:36 GMT
Here's my take on this matter:

In the current situation, there isn't any good solution to the data warehousing solution you
want in large scale. Impala and Drill are both projects that heads in this direction, but
they still have a way to go and are not production ready yet. If you can stay at MySQL for
moment, than stay there, or go for Hive but prepare a very large cluster of computers to handle
the load.

A normal data warehouse as you describe is composed of DIMS (dimensions) and FACT tables.
Representing this as is in HBase is a mess, since this will require you to do joins across
the clusters - i.e. RPC calls and lots of them between Region Servers - which will slow down
your queries to a halt (unless you want your user to wait 10-15 minutes).

The more sane approach then is do normalize the data - i.e. have a table containing the attributes
of all dimensions in the FACT table, as one big fat FACT table - and save it to HDFS or HBase.
Both have a partition key - your primary key to query upon (e.g. timestamp-customerId, timestamp-deviceId).
You can query the data, after you filter it by the partition key, thus scanning only a portion
of it, and then on each datanode/RS, filtering by the dimensions attributes as required by
your query. If your data is distributed evenly across your cluster, running this query on
multiple nodes at the same time can overcome the downside of fully reading the files/rows
belonging to the partition key. You can add the statistical functions you require, such as
sum,count, and send the rolled up results thus saving bandwidth.

The problem in current software stacks is that there's none that actually does what is stated
above. Impala is in the right direction, but its yet to be in production state, from what
I've read. Drill is just starting. Thus you end having to write map reduce jobs, which does
the described above solution by either employing HIVE to get the HDFS files stored by partition
key and translating you query into MR job, or using other open source solutions such as Cascading
to ease the burden of writing your own MR Job code.

So in summary, I would stay at Oracle/MySQL until a descent open source answering your need
will arrive - which I guess will happen during 2013/2014. If you can't - you will be forced
to write your own custom solution, tailored to your queries, based on MR job. You can take
a look at Trecul (https://github.com/akamai-tech/trecul) to boost processing speed of your
Map Reduce job.


On 13 בדצמ 2012, at 07:57, bigdata <bigdatabase@outlook.com> wrote:

> Dear all,
> We have a traditional star-model data warehouse in RDBMS, now we want to transfer it
to HBase. After study HBase, I learn that HBase is normally can be query by rowkey.
> 1.full rowkey (fastest)2.rowkey filter (fast)3.column family/qualifier filter (slow)
> How can I design the HBase tables to implement the warehouse functions, like:1.Query
by DimensionA2.Query by DimensionA and DimensionB3.Sum, count, distinct ...
> From my opinion, I should create several HBase tables with all combinations of different
dimensions as the rowkey. This solution will lead to huge data duplication. Is there any good
suggestions to solve it?
> Thanks a lot!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message