hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <victor.h...@nokia.com>
Subject Re: Web analytics and HBase
Date Sun, 13 Nov 2011 23:39:51 GMT

On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:

> Hi everyone, i had a question about HBase.
> 
> * Background:
> I'm working on analytics project and, so far, we are using MySQL as DBMS
> and Hadoop for data processing and aggregation. By now, we collect data
> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
> :P) we run Hadoop jobs for summarizing data in one day series as needed by
> every report (not relational, one denormalized table for every report).
> 
> Every report table structure is something like
> * metric_key (text)
> * timestamp
> * counter1
> * counter2
> * counter3
> * counter4
> 
> Query this data is very straight forward in SQL systems; grouping by
> metric_key, filtering by date and using aggregation functions on counters
> to calculate factors and coefficients.
> 
> * Problem: as everyone, data gets too big to fit in one single SQL machine
> and performance is dropping. By now, we receive about 600k events per day,
> summarized(some get grouped, some get discarded) to ~350k metrics
> (metric_key+timestamp pair).
> 
> * Question: reading the book, forum or mailing list, I dont find any clues
> to aggregation based on arbitrary time series slices. So, is there any way
> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
> every metric_key?
> 

> (I mean something like where date > :date_low AND date>:date_high group by
> metric_key)
> 
> I know that using timestamp as part of the key allows to range scan the
> table to fetch the rows. But i have no clue if there is any way to do sums
> on HBase. Or in the case there is no way, is crazy to do these aggregation
> calculations on top of it, after querying?
> 

In short, Yes. You can.  But you have to perform the grouping and sum in your own code after
scan the rows between 2011-09-01 and 2011-10-01.

But if you only need to perform daily sum, you can partition your data by date, then perform
aggregation using Java map-reduce, Hive, or pig.

-- Victor

> This are web reports, so the use of Hadoop(pig/hive) to render this data is
> totally discarded.
> 
> -- 
> Un saludo,
> Samuel García.


Mime
View raw message