hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: HBase schema question
Date Sat, 21 Jan 2012 13:48:30 GMT

Hi there-

re:  "Based on what I have read it looks like HBase is really good for
scans or row key lookup."

Yes.  It is also a good MR source and sink.

re:  "re:  how I can do joins"

You either need to denormalize it on the way in to Hbase or do a lookup.

re:  "Will that also be fast?"


Hbase has a block cache for frequently accessed data, and very recent HDFS
improvements (e.g., CH3u3) are a big improvement for random lookups and
HDFS throughput in general.  But for your needs this is like asking "how
long is a piece of string?"  You are going to need to try some prototypes
and see if it works for your particular situation.

re:  "sum, min, max"

As for sum, see MapReduce.  For the rest, search the threads on the
dist-list there has been a conversation on that recently.


On 1/21/12 2:32 AM, "Amit Gupta" <dlgamit16@gmail.com> wrote:

>I am not sure how I can do joins using HBase which is essentially what I
>am
>trying to do. Based on what I have read it looks
>like HBase is really good for scans or row key lookup. Please correct me
>if
>I am wrong.
>
>I can have a HBase table for users with {userid + timestamp} as the
>rowkey.
>Using this lookup for a single user for given time
>range will be fast. However I need to do lookups for millions of users for
>different time range. Will that also be fast ?
>
>Also lookups are not the only thing that I am trying to do. I need to
>compute statistics like sum, min, max etc for each data
>point for a user. How can I do that efficiently using Hbase ?
>
>
>On Fri, Jan 20, 2012 at 2:20 PM, T Vinod Gupta
><tvinod@readypulse.com>wrote:
>
>> from the little i have used hbase for, it is really good for the below
>>use
>> case you mentioned. hbase takes care of scale and you can use map
>>reduce to
>> do the kind of task you mentioned below.
>> but please remember that it is super important how you design the
>>schema.
>> the schema should allow for your use case and allow for an efficient map
>> reduce.
>> if you decide with hbase, read the hbase book before deployment or
>>schema
>> design/implementation.
>> thanks
>>
>> On Fri, Jan 20, 2012 at 2:10 PM, Amit Gupta <dlgamit16@gmail.com> wrote:
>>
>> > Hi,
>> >
>> >
>> >
>> > I am trying to figure out if Hbase is the right candidate for my use
>>case
>> > which is as follows :
>> >
>> >
>> >
>> > I have a users table containing millions users and for each user I
>>have a
>> > bunch of data points for each day in past
>> >
>> > 2 years. Some of these data points are number of clicks in different
>> parts
>> > of a web page, total # of clicks, total
>> >
>> > searches, # of unique searches etc. So the data is in this form :
>> >
>> >
>> >
>> > User Id
>> >
>> > Date
>> >
>> > X1 (Total Clicks)
>> >
>> > X2 (Total Searches)
>> >
>> > X3
>> >
>> > Š..
>> >
>> > Xn
>> >
>> > 1
>> >
>> > D1-730
>> >
>> > 4
>> >
>> > 0.8
>> >
>> >
>> >
>> >
>> >
>> > 90
>> >
>> > 1
>> >
>> > D1-729
>> >
>> > 2
>> >
>> > 0.5
>> >
>> >
>> >
>> >
>> >
>> > 50
>> >
>> > Š
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 1
>> >
>> > D1
>> >
>> > 30
>> >
>> > 0.9
>> >
>> >
>> >
>> >
>> >
>> > 20
>> >
>> > 2
>> >
>> > D1-730
>> >
>> > 23
>> >
>> > 1.2
>> >
>> >
>> >
>> >
>> >
>> > 85
>> >
>> > 2
>> >
>> > D1-729
>> >
>> > 56
>> >
>> > 2.3
>> >
>> >
>> >
>> >
>> >
>> > 56
>> >
>> > Š.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > My application has the following predominant query pattern - For a
>>subset
>> > of users (subset being quite large in order of 1 -5 mil), I want to do
>> sum,
>> > min, max, mean, standard deviation of data points for different date
>> ranges
>> > for the users. So for eg user1 may have a start and end date of {sd1,
>> ed1},
>> > user2 may have {sd2, ed2} and so on. I want to compute sum, min, max
>>etc
>> > for data points X1, X2, Š Xn over date ranges {sd1, ed1}, {sd2, ed2} ,
>> > {sd3, ed3} for each user in the subset .
>> >
>> >
>> >
>> > Currently we do this in db by creating a table for subset of the users
>> with
>> > their start and end day and joining against the users tables. The
>>query
>> > however is extremely slow and takes hours to execute.
>> >
>> >
>> >
>> > I am trying to figure out the following :
>> >
>> >   1. Can I do the above query efficiently (I want to reduce the query
>> >   time. Space is not that big of a concern for me) using Hbase ?
>> >
>> >
>> >   1. Can someone please give me alternative solutions if Hbase is not
>>the
>> >   right solution for such a use case ?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > dlg
>> >
>>



Mime
View raw message