hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: some data modeling questions related to Hbase
Date Fri, 14 Feb 2014 04:39:18 GMT
bq. This system has to support Hive, as most of users prefer SQL

Have you considered Apache Phoenix ?
See http://incubator.apache.org/projects/phoenix.html

Cheers


On Thu, Feb 13, 2014 at 6:49 PM, java8964 <java8964@hotmail.com> wrote:

> Hi,
> I am researching a project to load data from MySQL into HBase.
> The data is our user data. We want to analyze the user data with its
> activities and tracking. The user data is big. The record count reaches
> billion and whole data set size is about more than several Ts.
> The purpose of this data in this system is pure analyzing. The reason I am
> thinking of using HBase instead of HDFS is because of daily change of the
> data. Every day, there could be about 1 to 5% of user having any
> update/delete activities. Because HDFS is a write-once system, I have to
> regenerate the whole latest snapshot just for small change of the data, it
> is kind of wasting.
> So I am thinking storing the data into HBase. All other
> activities/tracking data (Fact data) will store in HDFS. I hope I can get
> good performance joining the data between HBase and HDFS for different
> datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> acceptable for my case. HBase here is only a sink/source of MR jobs in
> Hadoop. So I want to give it a try.
> This system has to support Hive, as most of users prefer SQL. When I want
> to store the data into HBase, I am not sure what is the best way to store
> them.
> As sample example, if I have data from the users table, and also have the
> data from addresses table, it is one-to-many relationship between them.
> Let's say I have user with (userid, name, ...)and address with (addressId,
> userId, street, state ...). I think I have following 2 options:
> Store them as 2 tables in HBase, using userid as row key of Users table is
> obvious. For address table, I can use either addressId as row key, or
> (userId + addressId) as row key. The benefits of this way is that it is a
> straight forward, and I have very easy mapping both tables into Hive
> external table. But I have following doubts:The data between 2 tables will
> be join again and again. If I can store them together during ETL, I can
> avoid thatEven if I choose (userId + addressId) as row key (I can map this
> row key into 2 columns in Hive), it won't really help me during the join,
> right? If I query like this (select * from addresses where userId = 'xxx'),
> the HBase still need to search across all the regions, as the row key is
> (userid + addressed), correct? Can HBase utilize that first part of the row
> key being queried?In this case, there is not much optimizing can use from
> HBase. It will be pure data source of MR job, right?Another way is to store
> them in one table. But since I don't know how many addresses a user could
> have, even if I create an address column family in user table, how do I
> create column name? I thought of using addressId as prefix of the column
> name in this CF, but from what I found out so far, it is impossible to map
> dynamic column names of HBase in Hive, especially when the count of columns
> is unknown. I would prefer to store this way, but I have to support Hive,
> and don't know how to archive that. Besides there are 7-8 user related
> tables, I am not sure store all of them together is a good idea.
> I want to know if anyone here can share some good/bad of each way. Since I
> am still very new to HBase, I have also some additional questions:
> 1) I know I should use short CF names and column names, so I am thinking
> just use index number as column names, like (1, 2, 3 ...). Is that fine?2)
> I am thinking storing frequently queried columns into one CF, the rest in
> another. How many CFs I can create? From HBase wiki page, it said only 2 or
> 3 CFs are enough. Is that kind of too small?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message