Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B6699085 for ; Tue, 14 Feb 2012 12:49:19 +0000 (UTC) Received: (qmail 22099 invoked by uid 500); 14 Feb 2012 12:49:17 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 22055 invoked by uid 500); 14 Feb 2012 12:49:17 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 22045 invoked by uid 99); 14 Feb 2012 12:49:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2012 12:49:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of igor.lautar@gmail.com designates 209.85.214.41 as permitted sender) Received: from [209.85.214.41] (HELO mail-bk0-f41.google.com) (209.85.214.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2012 12:49:10 +0000 Received: by bkty12 with SMTP id y12so6504312bkt.14 for ; Tue, 14 Feb 2012 04:48:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=4kmJgydgoBJHdTGCjpWos/xsLVk+N1GI7jLb8ipF4vw=; b=k+YKSGhf75Z2jprWbsyOD/t0J29jGN7q+q0iVeTRciZF50jDQJ/2ZWvYTmKxDMCV74 b3E8X5DehWLSRYCJHfsflPGWHdkgI6KVPyefLzgAJltGFH2AjrxiqegzZI0tK9BSitam 0RS2bv43aMOONV1KOZxtZma/DWghupnZZ/RQU= MIME-Version: 1.0 Received: by 10.204.128.202 with SMTP id l10mr8918351bks.116.1329223729527; Tue, 14 Feb 2012 04:48:49 -0800 (PST) Received: by 10.205.82.132 with HTTP; Tue, 14 Feb 2012 04:48:49 -0800 (PST) Date: Tue, 14 Feb 2012 13:48:49 +0100 Message-ID: Subject: investigating replacing RDBMS with HBase based solution - spliting daily data inflow? From: Igor Lautar To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=00151747b49a39bd2104b8ec0404 --00151747b49a39bd2104b8ec0404 Content-Type: text/plain; charset=ISO-8859-1 Hi All, I'm doing an investigation in performance and scalability improvements for one of solutions. I'm currently in a phase where I try to understand if HBase (+MapReduce) could provide the scalability needed. This is the current situation: - assume daily inflow of 10 GB of data (20+ milion rows) - daily job running on top of daily data - monthly job running on top of monthly data - random access to small amount of data going back in time for longer periods (assume a year) Now the HBase questions: 1) how would one approach splitting the data on nodes? Considering the daily MapReduce job it would have to run, it would be best to do separate data on daily basis? Is this possible with single table or would it make sense to have 1 table per day (or similar)? I did some investigation on this and it seems one could implement custom getSplits() to map only part in table containing daily data? Monthly job then just reuses the same data as daily, but it has to go through all days in month. 2) random access case Is this feasible with HBase at all? There could be something like few million random read requests going back a year in time. Note that certain amount of latency is not of a big issue as reads are done for independent operations. There are plans to support larger amounts of data. My thinking is that first 3 points could scale very good horizontally, what about random reads? Regards, Igor --00151747b49a39bd2104b8ec0404--