hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: HBase schema question
Date Sat, 21 Jan 2012 12:54:28 GMT
You don't do joins.

Sorry, but you need to put this in perspective... 

You need to get really drunk and with the next morning's hang over you need to look at HBASE
as HBASE and do not think in terms of a relational schema. 

Having said that, you can do joins, however they are tricky to do right and very expensive.


Think of a hierarchical model....

Note if you don't drink, you can always do exercise until you put your body in to a zone,
the you can start to think clearly.

I would say go meditate, but usually one just falls asleep and it is counter productive when
you live and work  in a society where your manager tells you that sleeping is highly overrated...
:-)

Sent from my iPhone

On Jan 21, 2012, at 1:33 AM, "Amit Gupta" <dlgamit16@gmail.com> wrote:

> I am not sure how I can do joins using HBase which is essentially what I am
> trying to do. Based on what I have read it looks
> like HBase is really good for scans or row key lookup. Please correct me if
> I am wrong.
> 
> I can have a HBase table for users with {userid + timestamp} as the rowkey.
> Using this lookup for a single user for given time
> range will be fast. However I need to do lookups for millions of users for
> different time range. Will that also be fast ?
> 
> Also lookups are not the only thing that I am trying to do. I need to
> compute statistics like sum, min, max etc for each data
> point for a user. How can I do that efficiently using Hbase ?
> 
> 
> On Fri, Jan 20, 2012 at 2:20 PM, T Vinod Gupta <tvinod@readypulse.com>wrote:
> 
>> from the little i have used hbase for, it is really good for the below use
>> case you mentioned. hbase takes care of scale and you can use map reduce to
>> do the kind of task you mentioned below.
>> but please remember that it is super important how you design the schema.
>> the schema should allow for your use case and allow for an efficient map
>> reduce.
>> if you decide with hbase, read the hbase book before deployment or schema
>> design/implementation.
>> thanks
>> 
>> On Fri, Jan 20, 2012 at 2:10 PM, Amit Gupta <dlgamit16@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> I am trying to figure out if Hbase is the right candidate for my use case
>>> which is as follows :
>>> 
>>> 
>>> 
>>> I have a users table containing millions users and for each user I have a
>>> bunch of data points for each day in past
>>> 
>>> 2 years. Some of these data points are number of clicks in different
>> parts
>>> of a web page, total # of clicks, total
>>> 
>>> searches, # of unique searches etc. So the data is in this form :
>>> 
>>> 
>>> 
>>> User Id
>>> 
>>> Date
>>> 
>>> X1 (Total Clicks)
>>> 
>>> X2 (Total Searches)
>>> 
>>> X3
>>> 
>>> …..
>>> 
>>> Xn
>>> 
>>> 1
>>> 
>>> D1-730
>>> 
>>> 4
>>> 
>>> 0.8
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 90
>>> 
>>> 1
>>> 
>>> D1-729
>>> 
>>> 2
>>> 
>>> 0.5
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 50
>>> 
>>> …
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> D1
>>> 
>>> 30
>>> 
>>> 0.9
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 20
>>> 
>>> 2
>>> 
>>> D1-730
>>> 
>>> 23
>>> 
>>> 1.2
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 85
>>> 
>>> 2
>>> 
>>> D1-729
>>> 
>>> 56
>>> 
>>> 2.3
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 56
>>> 
>>> ….
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> My application has the following predominant query pattern - For a subset
>>> of users (subset being quite large in order of 1 -5 mil), I want to do
>> sum,
>>> min, max, mean, standard deviation of data points for different date
>> ranges
>>> for the users. So for eg user1 may have a start and end date of {sd1,
>> ed1},
>>> user2 may have {sd2, ed2} and so on. I want to compute sum, min, max etc
>>> for data points X1, X2, … Xn over date ranges {sd1, ed1}, {sd2, ed2} ,
>>> {sd3, ed3} for each user in the subset .
>>> 
>>> 
>>> 
>>> Currently we do this in db by creating a table for subset of the users
>> with
>>> their start and end day and joining against the users tables. The query
>>> however is extremely slow and takes hours to execute.
>>> 
>>> 
>>> 
>>> I am trying to figure out the following :
>>> 
>>>  1. Can I do the above query efficiently (I want to reduce the query
>>>  time. Space is not that big of a concern for me) using Hbase ?
>>> 
>>> 
>>>  1. Can someone please give me alternative solutions if Hbase is not the
>>>  right solution for such a use case ?
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> dlg
>>> 
>> 
Mime
View raw message