hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Invisible.Trust" <invisible.tr...@gmail.com>
Subject Re: HBase schema question
Date Sat, 21 Jan 2012 07:48:03 GMT
I think you need to design your schema with as many tables as many 
indexes you want.
For example: tbl1 {user_id_timestamp}
tbl2 {md5(email)} [user_id_timestamp]

Also you may be want to look at google "design patterns hbase"
Also some examples here : "Oreilly.HBase.The.Definitive.Guide.Aug.2011"


21.01.12 11:32, Amit Gupta пишет:
> I am not sure how I can do joins using HBase which is essentially what I am
> trying to do. Based on what I have read it looks
> like HBase is really good for scans or row key lookup. Please correct me if
> I am wrong.
>
> I can have a HBase table for users with {userid + timestamp} as the rowkey.
> Using this lookup for a single user for given time
> range will be fast. However I need to do lookups for millions of users for
> different time range. Will that also be fast ?
>
> Also lookups are not the only thing that I am trying to do. I need to
> compute statistics like sum, min, max etc for each data
> point for a user. How can I do that efficiently using Hbase ?
>
>
> On Fri, Jan 20, 2012 at 2:20 PM, T Vinod Gupta<tvinod@readypulse.com>wrote:
>
>> from the little i have used hbase for, it is really good for the below use
>> case you mentioned. hbase takes care of scale and you can use map reduce to
>> do the kind of task you mentioned below.
>> but please remember that it is super important how you design the schema.
>> the schema should allow for your use case and allow for an efficient map
>> reduce.
>> if you decide with hbase, read the hbase book before deployment or schema
>> design/implementation.
>> thanks
>>
>> On Fri, Jan 20, 2012 at 2:10 PM, Amit Gupta<dlgamit16@gmail.com>  wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I am trying to figure out if Hbase is the right candidate for my use case
>>> which is as follows :
>>>
>>>
>>>
>>> I have a users table containing millions users and for each user I have a
>>> bunch of data points for each day in past
>>>
>>> 2 years. Some of these data points are number of clicks in different
>> parts
>>> of a web page, total # of clicks, total
>>>
>>> searches, # of unique searches etc. So the data is in this form :
>>>
>>>
>>>
>>> User Id
>>>
>>> Date
>>>
>>> X1 (Total Clicks)
>>>
>>> X2 (Total Searches)
>>>
>>> X3
>>>
>>> …..
>>>
>>> Xn
>>>
>>> 1
>>>
>>> D1-730
>>>
>>> 4
>>>
>>> 0.8
>>>
>>>
>>>
>>>
>>>
>>> 90
>>>
>>> 1
>>>
>>> D1-729
>>>
>>> 2
>>>
>>> 0.5
>>>
>>>
>>>
>>>
>>>
>>> 50
>>>
>>> …
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 1
>>>
>>> D1
>>>
>>> 30
>>>
>>> 0.9
>>>
>>>
>>>
>>>
>>>
>>> 20
>>>
>>> 2
>>>
>>> D1-730
>>>
>>> 23
>>>
>>> 1.2
>>>
>>>
>>>
>>>
>>>
>>> 85
>>>
>>> 2
>>>
>>> D1-729
>>>
>>> 56
>>>
>>> 2.3
>>>
>>>
>>>
>>>
>>>
>>> 56
>>>
>>> ….
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> My application has the following predominant query pattern - For a subset
>>> of users (subset being quite large in order of 1 -5 mil), I want to do
>> sum,
>>> min, max, mean, standard deviation of data points for different date
>> ranges
>>> for the users. So for eg user1 may have a start and end date of {sd1,
>> ed1},
>>> user2 may have {sd2, ed2} and so on. I want to compute sum, min, max etc
>>> for data points X1, X2, … Xn over date ranges {sd1, ed1}, {sd2, ed2} ,
>>> {sd3, ed3} for each user in the subset .
>>>
>>>
>>>
>>> Currently we do this in db by creating a table for subset of the users
>> with
>>> their start and end day and joining against the users tables. The query
>>> however is extremely slow and takes hours to execute.
>>>
>>>
>>>
>>> I am trying to figure out the following :
>>>
>>>    1. Can I do the above query efficiently (I want to reduce the query
>>>    time. Space is not that big of a concern for me) using Hbase ?
>>>
>>>
>>>    1. Can someone please give me alternative solutions if Hbase is not the
>>>    right solution for such a use case ?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> dlg
>>>


Mime
View raw message