Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 241409349 for ; Thu, 26 Jul 2012 13:34:33 +0000 (UTC) Received: (qmail 18148 invoked by uid 500); 26 Jul 2012 13:26:38 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 26186 invoked by uid 500); 26 Jul 2012 13:21:20 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 16135 invoked by uid 99); 26 Jul 2012 06:40:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2012 06:40:35 +0000 X-ASF-Spam-Status: No, hits=0.9 required=5.0 tests=FSL_RCVD_USER,RCVD_NUMERIC_HELO,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gcjhhu-hbase-user@m.gmane.org designates 80.91.229.3 as permitted sender) Received: from [80.91.229.3] (HELO plane.gmane.org) (80.91.229.3) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2012 06:40:27 +0000 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1SuHjx-0005Kd-CD for user@hbase.apache.org; Thu, 26 Jul 2012 08:40:05 +0200 Received: from 192.71.175.2 ([192.71.175.2]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 26 Jul 2012 08:40:05 +0200 Received: from padmanaban.mathulu by 192.71.175.2 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 26 Jul 2012 08:40:05 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: user@hbase.apache.org From: Padmanaban Subject: Hbase Data Model to purge old data. Date: Thu, 26 Jul 2012 06:34:02 +0000 (UTC) Lines: 31 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: sea.gmane.org User-Agent: Loom/3.14 (http://gmane.org/) X-Loom-IP: 192.71.175.2 (Mozilla/5.0 (X11; Linux i686; rv:1.9.2.8) Gecko/20100722 Firefox/4.0.1) X-Virus-Checked: Checked by ClamAV on apache.org We have the following use case: Store telecom CDR data on a per subscriber basis data is time series based and every record is per-subscriber based comes in round the clock the expected volume of data would be around 300 million records/day. this data is to be queried 24/7 by an online system where the filters are subscriber id and date range Since the volume of data is huge, we have data retention policies to archive old data on a daily basis. For example, if retention is set to 90 days, every day a offline process would delete data from Hbase which is older than 90 days and archive it on tape. The current HBase data model design is as follows: Separate table for every day's data with row key as subscriber id: reason for this is bulk delete of one days data within a big table is more expensive than dropping a one day table In this per-day-separate-table model, the load balancer will never get triggered as the current days table is always in memory, and daughter regions will continuously get assigned to same region server. This leads to a region server hotspots. Please feedback on whether the per-day-separate-table model is the best-practice for this use case considering the data life cycle management requirement. If yes, how do we solve the side effect of region server hotspot? If no, please advice alternate model Thanks in advance, Padmanaban M