Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C65B09B72 for ; Mon, 6 Feb 2012 16:59:28 +0000 (UTC) Received: (qmail 79778 invoked by uid 500); 6 Feb 2012 16:59:27 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 79588 invoked by uid 500); 6 Feb 2012 16:59:26 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 79580 invoked by uid 99); 6 Feb 2012 16:59:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 16:59:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jonathan.bender@gmail.com designates 209.85.215.41 as permitted sender) Received: from [209.85.215.41] (HELO mail-lpp01m010-f41.google.com) (209.85.215.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 16:59:18 +0000 Received: by lamf4 with SMTP id f4so4404052lam.14 for ; Mon, 06 Feb 2012 08:58:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=T2OClG0w0yeehQDq+wo5OBhOa45K7fRPj1fVi9Noia0=; b=W912t+eXSDPVV5gPG+0cxsbyKOqrNiyNJko4zilfBdiMZf/MZcwyUuoU2Oz8oFZOUO CR015/p8Akj7rG4lf5zapjZzV1teMnUY4H76YR3QwTf+qpyBOA1puJyfUiy692M2cRzh KjqoK6y54SU42entz3K7ut6gPM5Ko93ETsZoc= Received: by 10.152.130.167 with SMTP id of7mr9809269lab.36.1328547538255; Mon, 06 Feb 2012 08:58:58 -0800 (PST) MIME-Version: 1.0 Received: by 10.152.22.200 with HTTP; Mon, 6 Feb 2012 08:58:38 -0800 (PST) In-Reply-To: References: From: Jon Bender Date: Mon, 6 Feb 2012 08:58:38 -0800 Message-ID: Subject: Re: HBase Read Performance - Multiget vs TableInputFormat Job To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=f46d042c6c3f15be6104b84e9401 X-Virus-Checked: Checked by ClamAV on apache.org --f46d042c6c3f15be6104b84e9401 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the responses! >What percentage of total data is the 300k new rows? A constantly shrinking percentage--we may retain upwards of 5 years of data here, so running against the full table will get very expensive going forward. I think the second approach sounds best. >If you have the list of the 300k, this could work. You could write a mapreduce job that divided the 300k into maps and in each mapper run a client to do multiget (it'll sort the gets by regions for you). When you say it'll sort regions by you, does that mean I'll need to identify the regions before dividing up the maps? Or just deal with the fact that multiple maps might read from the same regionserver? --Jon On Mon, Feb 6, 2012 at 8:21 AM, Stack wrote: > On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender > wrote: > > The two alternatives I am exploring are > > > > 1. Running a TableInputFormat MR job that filters for data added in the > > past day (Scan on the internal timestamp range of the cells) > > You'll touch all your data when you do this. > > What percentage of total data is the 300k new rows? > > > 2. Using a batched get (multiGet) with a list of the rows were written > > the previous day, most likely using a number of HBase client processes > to > > read this data out in parallel. > > > > If you have the list of the 300k, this could work. You could write a > mapreduce job that divided the 300k into maps and in each mapper run a > client to do multiget (it'll sort the gets by regions for you). > > St.Ack > --f46d042c6c3f15be6104b84e9401--