hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Socialyantra <tvi...@socialyantra.com>
Subject Re: Web analytics and HBase
Date Mon, 14 Nov 2011 16:23:58 GMT
We use hbase for exactly this kind of stuff and it works great. It depends on how you design
your data model in hbase. Based on your use case, define a good row key that has the notion
of timestamp in it. Your counters could be column identifiers. Yes aggregation needs to be
in code.

In fact, we went a step further and even do deep dive using hbase. E.g you not only want a
count of all IPs that hit your site on a day but also the IPs list itself.

Thanks
Vinod

Sent from my iPhone.

On Nov 13, 2011, at 3:39 PM, <victor.hong@nokia.com> wrote:

> 
> On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:
> 
>> Hi everyone, i had a question about HBase.
>> 
>> * Background:
>> I'm working on analytics project and, so far, we are using MySQL as DBMS
>> and Hadoop for data processing and aggregation. By now, we collect data
>> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
>> :P) we run Hadoop jobs for summarizing data in one day series as needed by
>> every report (not relational, one denormalized table for every report).
>> 
>> Every report table structure is something like
>> * metric_key (text)
>> * timestamp
>> * counter1
>> * counter2
>> * counter3
>> * counter4
>> 
>> Query this data is very straight forward in SQL systems; grouping by
>> metric_key, filtering by date and using aggregation functions on counters
>> to calculate factors and coefficients.
>> 
>> * Problem: as everyone, data gets too big to fit in one single SQL machine
>> and performance is dropping. By now, we receive about 600k events per day,
>> summarized(some get grouped, some get discarded) to ~350k metrics
>> (metric_key+timestamp pair).
>> 
>> * Question: reading the book, forum or mailing list, I dont find any clues
>> to aggregation based on arbitrary time series slices. So, is there any way
>> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
>> every metric_key?
>> 
> 
>> (I mean something like where date > :date_low AND date>:date_high group by
>> metric_key)
>> 
>> I know that using timestamp as part of the key allows to range scan the
>> table to fetch the rows. But i have no clue if there is any way to do sums
>> on HBase. Or in the case there is no way, is crazy to do these aggregation
>> calculations on top of it, after querying?
>> 
> 
> In short, Yes. You can.  But you have to perform the grouping and sum in your own code
after scan the rows between 2011-09-01 and 2011-10-01.
> 
> But if you only need to perform daily sum, you can partition your data by date, then
perform aggregation using Java map-reduce, Hive, or pig.
> 
> -- Victor
> 
>> This are web reports, so the use of Hadoop(pig/hive) to render this data is
>> totally discarded.
>> 
>> -- 
>> Un saludo,
>> Samuel García.
> 

Mime
View raw message