Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED2B3725B for ; Thu, 1 Dec 2011 18:03:49 +0000 (UTC) Received: (qmail 66820 invoked by uid 500); 1 Dec 2011 18:03:48 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 66791 invoked by uid 500); 1 Dec 2011 18:03:48 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 66783 invoked by uid 99); 1 Dec 2011 18:03:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 18:03:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.213.169 as permitted sender) Received: from [209.85.213.169] (HELO mail-yx0-f169.google.com) (209.85.213.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 18:03:41 +0000 Received: by yenq10 with SMTP id q10so2498534yen.14 for ; Thu, 01 Dec 2011 10:03:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=i6wE83hc4F2vgEK75jhIlUJjTM4Y8NZ62QNcKQXLow8=; b=KVJXq0ZSqGPFE4ZB41iDORW28iQj06H07xc5DvzmRugeuDgFBXTtuubGU6I6DuxGZd CAq9aGDL5GSKfuKOQD2KszKpBtAf135k+Y8cN2uYiV9DEirgBrrbYsfO6CrFGlCiek63 IOVqOzkQpyPHnazJn3mdoWnp5tEd12Vj47DA4= MIME-Version: 1.0 Received: by 10.236.173.202 with SMTP id v50mr13476584yhl.102.1322762600371; Thu, 01 Dec 2011 10:03:20 -0800 (PST) Sender: jdcryans@gmail.com Received: by 10.100.57.12 with HTTP; Thu, 1 Dec 2011 10:03:20 -0800 (PST) In-Reply-To: References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DE41@kairo.scch.at> Date: Thu, 1 Dec 2011 10:03:20 -0800 X-Google-Sender-Auth: 1mw-R6YPMSFJQEl5zObNiacXh1M Message-ID: Subject: Re: Strategies for aggregating data in a HBase table From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Or you could just prefix the row keys. Not sure if this is needed natively, or as a tool on top of HBase. Hive for example could do exactly that for you when Hive partitions are implemented for HBase. J-D On Wed, Nov 30, 2011 at 1:34 PM, Sam Seigal wrote: > What about "partitioning" at a table level. For example, create 12 > tables for the given year. Design the row keys however you like, let's > say using SHA/MD hashes. Place transactions in the appropriate table > and then do aggregations based on that table alone (this is assuming > you won't get transactions with timestamps in the past going back a > month). The idea is to archive the tables for a given year and start > fresh the next. This is acceptable in my use case. I am in the process > of trying this out, so do not have any performance numbers, issues yet > ... Experts can comment. > > On a further note, having HBase support this natively i.e. one more > level of partitioning above the row key , but below a table can be > beneficial for use cases like these ones. Comments ... ? > > On Wed, Nov 30, 2011 at 11:53 AM, Jean-Daniel Cryans > wrote: >> Inline. >> >> J-D >> >> On Mon, Nov 28, 2011 at 1:55 AM, Steinmaurer Thomas >> wrote: >>> Hello, >>> ... >>> >>> While it is an option processing the entire HBase table e.g. every night >>> when we go live, it probably isn't an option when data volume grows over >>> the years. So, what options are there for some kind of incremental >>> aggregating only new data? >> >> Yeah you don't want to go there. >> >>> >>> - Perhaps using versioning (internal timestamp) might be an option? >> >> I guess you could do rollups and ditch the raw data, if you don't need it. >> >>> >>> - Perhaps having some kind of HBase (daily) staging table which is >>> truncated after aggregating data is an option? >> >> If you do the aggregations nightly then you won't have "access to >> aggregated data very quickly". >> >>> >>> - How could Co-processors help here (at the time of the Go-Live, they >>> might be available in e.g. Cloudera)? >> >> Coprocessors are more like an internal HBase tool, so don't put all >> your eggs there until you play with them. What you could do is get the >> 0.92.0 RC0 tarball and try them out :) >> >>> Any ideas/comments are appreciated. >> >> Normally data is stored in a way that's not easy to query in a batch >> or analytics mode, so an ETL step is introduced. You'll probably need >> to do the same, as in you could asynchronously stream your data to >> other HBase tables or Hive or Pig via logs or replication and then >> directly insert it into the format it needs to be or stage it for >> later aggregations. If you explore those avenues I'm sure you'll find >> concepts that are very very similar to those you listed regarding >> RDBMS. >> >> You could also keep live counts using atomic increments, you'd issue >> those at write time or async. >> >> Hope this helps, >> >> J-D