Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates
 209.85.213.169 as permitted sender)
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: 
 <CAB0Jq2LJXsEmX2nqSHi5CD9Zyz22SfweR+75rr=a6aJX768osw@mail.gmail.com>
References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DE41@kairo.scch.at>
	<CAGpTDNfwYMZZH9OWFHQVh_WOwBzZso74LmSvGPDDqWkj0T9bXQ@mail.gmail.com>
	<CAB0Jq2LJXsEmX2nqSHi5CD9Zyz22SfweR+75rr=a6aJX768osw@mail.gmail.com>
Date: Thu, 1 Dec 2011 10:03:20 -0800
Message-ID: 
 <CAGpTDNfo+yxVisDrcbRrK5vZjQrJB0Qc=eBGftCbD8Ehm4P+8A@mail.gmail.com>
Subject: Re: Strategies for aggregating data in a HBase table
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Or you could just prefix the row keys. Not sure if this is needed
natively, or as a tool on top of HBase. Hive for example could do
exactly that for you when Hive partitions are implemented for HBase.

J-D

On Wed, Nov 30, 2011 at 1:34 PM, Sam Seigal <selekt86@yahoo.com> wrote:
> What about "partitioning" at a table level. For example, create 12
> tables for the given year. Design the row keys however you like, let's
> say using SHA/MD hashes. Place transactions in the appropriate table
> and then do aggregations based on that table alone (this is assuming
> you won't get transactions with timestamps in the past going back a
> month). The idea is to archive the tables for a given year and start
> fresh the next. This is acceptable in my use case. I am in the process
> of trying this out, so do not have any performance numbers, issues yet
> ... Experts can comment.
>
> On a further note, having HBase support this natively i.e. one more
> level of partitioning above the row key , but below a table can be
> beneficial for use cases like these ones. Comments ... ?
>
> On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans
> <jdcryans@apache.org> wrote:
>> Inline.
>>
>> J-D
>>
>> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas
>> <Thomas.Steinmaurer@scch.at> wrote:
>>> Hello,
>>> ...
>>>
>>> While it is an option processing the entire HBase table e.g. every night
>>> when we go live, it probably isn't an option when data volume grows over
>>> the years. So, what options are there for some kind of incremental
>>> aggregating only new data?
>>
>> Yeah you don't want to go there.
>>
>>>
>>> - Perhaps using versioning (internal timestamp) might be an option?
>>
>> I guess you could do rollups and ditch the raw data, if you don't need it.
>>
>>>
>>> - Perhaps having some kind of HBase (daily) staging table which is
>>> truncated after aggregating data is an option?
>>
>> If you do the aggregations nightly then you won't have "access to
>> aggregated data very quickly".
>>
>>>
>>> - How could Co-processors help here (at the time of the Go-Live, they
>>> might be available in e.g. Cloudera)?
>>
>> Coprocessors are more like an internal HBase tool, so don't put all
>> your eggs there until you play with them. What you could do is get the
>> 0.92.0 RC0 tarball and try them out :)
>>
>>> Any ideas/comments are appreciated.
>>
>> Normally data is stored in a way that's not easy to query in a batch
>> or analytics mode, so an ETL step is introduced. You'll probably need
>> to do the same, as in you could asynchronously stream your data to
>> other HBase tables or Hive or Pig via logs or replication and then
>> directly insert it into the format it needs to be or stage it for
>> later aggregations. If you explore those avenues I'm sure you'll find
>> concepts that are very very similar to those you listed regarding
>> RDBMS.
>>
>> You could also keep live counts using atomic increments, you'd issue
>> those at write time or async.
>>
>> Hope this helps,
>>
>> J-D