hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: some data modeling questions related to Hbase
Date Fri, 14 Feb 2014 14:36:52 GMT
Hi, Ted:
I will take a look.

> Date: Thu, 13 Feb 2014 20:39:18 -0800
> Subject: Re: some data modeling questions related to Hbase
> From: yuzhihong@gmail.com
> To: user@hbase.apache.org
> bq. This system has to support Hive, as most of users prefer SQL
> Have you considered Apache Phoenix ?
> See http://incubator.apache.org/projects/phoenix.html
> Cheers
> On Thu, Feb 13, 2014 at 6:49 PM, java8964 <java8964@hotmail.com> wrote:
> > Hi,
> > I am researching a project to load data from MySQL into HBase.
> > The data is our user data. We want to analyze the user data with its
> > activities and tracking. The user data is big. The record count reaches
> > billion and whole data set size is about more than several Ts.
> > The purpose of this data in this system is pure analyzing. The reason I am
> > thinking of using HBase instead of HDFS is because of daily change of the
> > data. Every day, there could be about 1 to 5% of user having any
> > update/delete activities. Because HDFS is a write-once system, I have to
> > regenerate the whole latest snapshot just for small change of the data, it
> > is kind of wasting.
> > So I am thinking storing the data into HBase. All other
> > activities/tracking data (Fact data) will store in HDFS. I hope I can get
> > good performance joining the data between HBase and HDFS for different
> > datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> > acceptable for my case. HBase here is only a sink/source of MR jobs in
> > Hadoop. So I want to give it a try.
> > This system has to support Hive, as most of users prefer SQL. When I want
> > to store the data into HBase, I am not sure what is the best way to store
> > them.
> > As sample example, if I have data from the users table, and also have the
> > data from addresses table, it is one-to-many relationship between them.
> > Let's say I have user with (userid, name, ...)and address with (addressId,
> > userId, street, state ...). I think I have following 2 options:
> > Store them as 2 tables in HBase, using userid as row key of Users table is
> > obvious. For address table, I can use either addressId as row key, or
> > (userId + addressId) as row key. The benefits of this way is that it is a
> > straight forward, and I have very easy mapping both tables into Hive
> > external table. But I have following doubts:The data between 2 tables will
> > be join again and again. If I can store them together during ETL, I can
> > avoid thatEven if I choose (userId + addressId) as row key (I can map this
> > row key into 2 columns in Hive), it won't really help me during the join,
> > right? If I query like this (select * from addresses where userId = 'xxx'),
> > the HBase still need to search across all the regions, as the row key is
> > (userid + addressed), correct? Can HBase utilize that first part of the row
> > key being queried?In this case, there is not much optimizing can use from
> > HBase. It will be pure data source of MR job, right?Another way is to store
> > them in one table. But since I don't know how many addresses a user could
> > have, even if I create an address column family in user table, how do I
> > create column name? I thought of using addressId as prefix of the column
> > name in this CF, but from what I found out so far, it is impossible to map
> > dynamic column names of HBase in Hive, especially when the count of columns
> > is unknown. I would prefer to store this way, but I have to support Hive,
> > and don't know how to archive that. Besides there are 7-8 user related
> > tables, I am not sure store all of them together is a good idea.
> > I want to know if anyone here can share some good/bad of each way. Since I
> > am still very new to HBase, I have also some additional questions:
> > 1) I know I should use short CF names and column names, so I am thinking
> > just use index number as column names, like (1, 2, 3 ...). Is that fine?2)
> > I am thinking storing frequently queried columns into one CF, the rest in
> > another. How many CFs I can create? From HBase wiki page, it said only 2 or
> > 3 CFs are enough. Is that kind of too small?
> >
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message