hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharma, Avani" <agsha...@ebay.com>
Subject RE: Hbase schema design question for time based data
Date Wed, 16 Jun 2010 21:26:59 GMT
Thanks Jonathan. 

Could you point me to the trunk to look at an example mapper that extends TableMapper. I am
unable to assign this internal mapper to org.apache.hadoop.mapreduce.Job in hadoop.

I am not interested in HFileOutputFormat, just want the same code as in Allen Day's blog written
using new API for reference. 


-----Original Message-----
From: Jonathan Gray [mailto:jgray@facebook.com] 
Sent: Wednesday, June 16, 2010 11:40 AM
To: user@hbase.apache.org
Subject: RE: Hbase schema design question for time based data

> Hi,
> 
> I am trying design schema for some data to be moved from HDFS into
> HBase for real-time access.
> Questions -
> 
> 1. Is the use of new API for bulk upload recommended over old API? If
> yes, is the new API stable and is there sample executable code around ?

Not sure if there is much sample code in branch but Todd Lipcon has done some great work in
trunk that includes some example code I believe.

There's going to be a short presentation on HFileOutputFormat and bulk loading at the HUG
on June 30th if you're interested in attending (http://meetup.com/hbaseusergroup).

In general it came make lots of sense for particular use cases, so sometimes it is recommended
and sometimes not.  Depends on the requirements.


> 2. The data is over time. I need to be able to retrieve the latest
> records before a particular date. Note that I do not know what
> timestamp that would be.
>    I could need a user's profile data from a month or year earlier. How
> can this be achieved using Hbase in terms of schema?
> 
>                 a. If the column values are small in size, can I use
> versioning for upto 100 values ?

Versioning can be used for thousands or possibly millions of versions of a single column.
 There are some performance TODOs related to making TimeRange queries more efficient that
I am working on that are in the pipeline for the next couple months.

If you're generally reading the more recent versions then performance should be acceptable.
 Reading back into some of the older ones will work but is currently not nearly as efficient
as it can be.


>                 b. Should I maintain a secondary index for each date
> and the latest date/timestamp when profile data is generated/applicable
> to that date?                                    Use this information
> to come up with user and timestamp key in the main table which would
> have user_ts as row_key and data in the columns ?

Not sure exactly what you mean here but doesn't seem you would really need a secondary index
to do what you want.  When using versioning you can always ask for "give me 10 latest versions"
or "give me the 100 latest versions that occur after date X".

> 
>                 c. for the columns, how do I decide between using
> multiple columns within a column family or multiple column families?

This depends on the read/write patterns.  Do the different families have different access
patterns?  Do you often read from just one family and not the others, or write to just one
family and not the others?  This would be a good reason to split up into families.  If the
data all has a similar access pattern then should probably put them in a single family.  Each
family is basically like a table, each is stored separately on disk.

I think an in-person discussion would help a lot, since you are local (I am guessing), see
if you can come by the Hackathon or HUG in two weeks and we can talk more on it.  Can then
post back to the list once we figure a decent solution to your use case.

JG 


Mime
View raw message