hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Strategies for aggregating data in a HBase table
Date Wed, 21 Dec 2011 08:14:01 GMT
https://github.com/dlyubimov/HBase-Lattice

On Wed, Dec 21, 2011 at 12:13 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> Thomas,
>
> Sorry for shameless self-promotion. Can you look at our hbase-lattice
> project? it is incremental OLAP-ish cube compilation with custom
> filtering to optimize for composite key scans. Some rudimental query
> language as well.
>
> Bunch of standard (and not so standard) aggregates for measure data
> and ability to relatively easily add user aggregate thru model
> definiton.
>
> Very early stage. But see if it could fit your purpose, maybe even
> share some perspectives since i am honestly not an expert on
> dimensional data representation.
>
> (I guess i need to add some query shell so people can try it out more easily.. )
>
> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
> <Thomas.Steinmaurer@scch.at> wrote:
>> Hello,
>>
>>
>>
>> this has been already discussed a bit in the past, but I'm trying to
>> refresh this thread as this is an important design issue in our HBase
>> evaluation.
>>
>>
>>
>> Basically, the result of our evaluation was that we gonna be happy with
>> what Hadoop/HBase offers for managing our measurement/sensor data.
>> Although one crucial thing for e.g. backend analysis tasks is, we need
>> access to aggregated data very quickly. The idea is to run a MapReduce
>> job and store the dialy aggregates in a RDBMS, which allows us to access
>> aggregated data more easily via different tools (BI frontends etc.).
>> Monthly and yearly aggregates are then handled with RDBMS concepts like
>> Materialized Views and Partitioning.
>>
>>
>>
>> While it is an option processing the entire HBase table e.g. every night
>> when we go live, it probably isn't an option when data volume grows over
>> the years. So, what options are there for some kind of incremental
>> aggregating only new data?
>>
>>
>>
>> - Perhaps using versioning (internal timestamp) might be an option?
>>
>> - Perhaps having some kind of HBase (daily) staging table which is
>> truncated after aggregating data is an option?
>>
>> - How could Co-processors help here (at the time of the Go-Live, they
>> might be available in e.g. Cloudera)?
>>
>>
>>
>> etc.
>>
>>
>>
>> Any ideas/comments are appreciated.
>>
>>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>

Mime
View raw message