hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From inder.p...@gmail.com
Subject Re: Web analytics and HBase
Date Mon, 14 Nov 2011 04:10:48 GMT
A note - storing time series data in hbase can cause hot spots and splits...have you looked
at opentsdb

Sent from BlackBerry® on Airtel

-----Original Message-----
From: <victor.hong@nokia.com>
Date: Sun, 13 Nov 2011 23:39:51 
To: <user@hbase.apache.org>
Reply-To: user@hbase.apache.org
Subject: Re: Web analytics and HBase

On Nov 13, 2011, at 6:11 PM, ext Samuel García Martínez wrote:

> Hi everyone, i had a question about HBase.
> * Background:
> I'm working on analytics project and, so far, we are using MySQL as DBMS
> and Hadoop for data processing and aggregation. By now, we collect data
> analytics over HTTP and pushes to Hadoop. Every day (in fact, every night
> :P) we run Hadoop jobs for summarizing data in one day series as needed by
> every report (not relational, one denormalized table for every report).
> Every report table structure is something like
> * metric_key (text)
> * timestamp
> * counter1
> * counter2
> * counter3
> * counter4
> Query this data is very straight forward in SQL systems; grouping by
> metric_key, filtering by date and using aggregation functions on counters
> to calculate factors and coefficients.
> * Problem: as everyone, data gets too big to fit in one single SQL machine
> and performance is dropping. By now, we receive about 600k events per day,
> summarized(some get grouped, some get discarded) to ~350k metrics
> (metric_key+timestamp pair).
> * Question: reading the book, forum or mailing list, I dont find any clues
> to aggregation based on arbitrary time series slices. So, is there any way
> to query HBase to get the counter3 sum between 2011-09-01 to 2011-10-01 for
> every metric_key?

> (I mean something like where date > :date_low AND date>:date_high group by
> metric_key)
> I know that using timestamp as part of the key allows to range scan the
> table to fetch the rows. But i have no clue if there is any way to do sums
> on HBase. Or in the case there is no way, is crazy to do these aggregation
> calculations on top of it, after querying?

In short, Yes. You can.  But you have to perform the grouping and sum in your own code after
scan the rows between 2011-09-01 and 2011-10-01.

But if you only need to perform daily sum, you can partition your data by date, then perform
aggregation using Java map-reduce, Hive, or pig.

-- Victor

> This are web reports, so the use of Hadoop(pig/hive) to render this data is
> totally discarded.
> -- 
> Un saludo,
> Samuel García.

View raw message