hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: HBase for ad-hoc aggregate queries
Date Wed, 11 Jan 2012 19:48:32 GMT
IMO You will never get the same flexibility. There are also numerous
differences in data modelling approach (TTL, uniformly-distributed ids
requirement to scale query volume, etc.)

The most flexibility in that regard we reached so far w.r.t.
aggregation queries is OLAPish model (see link on HBase wiki,
supported projects, HBase-Lattice).

This is for aggregating really high qps  RT fact streams and the list
of current limitations is huge but it serves our purpose so far.

Most obvious benefits are that queries are fast (because of
precomputed cuboids in a lattice, similar to cuboid lattice approach
in ROLAP), short incremental compilation cycle (one can grow and
update the cube in just a few minutes after the fact got fed into
system), and one can scale compilation horizontally for high volume
fact feeds. There's a fairly limited query language and a basic set of
aggregate functions (along with some weighted time series aggregates
as well).

Most severe limitation right now is lack of commonly used
multidimensional query dialect such as MDX which prevents use of the
widely used UI pivoting exploratory clients such as excel or JPivot or
Tableau etc. So it is either custom UI integration or custom data
source providers for canned reports with tools like pentaho and
jasper, or some RT decisioning framework that doesn't require any UI
at all and can use java API. I also plan to enable R to run queries
against it (cause i personally don't beleive in doing ml or analytics
using Excel).

-d

On Wed, Jan 11, 2012 at 10:59 AM, kfarmer <kfarmer@camstar.com> wrote:
>
> I'm taking a look at moving our datastore from Oracle to HBase, and trying to
> understand how HBase could be used for ad-hoc aggregation queries across our
> data.
>
> My understanding is MapReduce is more of a batch framework, so if we want a
> query to come back to the user's request in a few seconds, that won't work
> because of the overheard of running MR and because the MR jobs write back to
> a new table.  Is that correct?
>
> Instead should we be pre-aggregating data as we load into separate tables,
> and then when a user queries instead just do a scan on these pre-aggregated
> tables?
>
> Thanks.
> --
> View this message in context: http://old.nabble.com/HBase-for-ad-hoc-aggregate-queries-tp33123313p33123313.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Mime
View raw message