Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58A8E11CD2 for ; Mon, 21 Apr 2014 18:06:04 +0000 (UTC) Received: (qmail 5705 invoked by uid 500); 21 Apr 2014 18:06:00 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 5582 invoked by uid 500); 21 Apr 2014 18:06:00 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 5574 invoked by uid 99); 21 Apr 2014 18:06:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 18:06:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com designates 209.85.216.173 as permitted sender) Received: from [209.85.216.173] (HELO mail-qc0-f173.google.com) (209.85.216.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Apr 2014 18:05:56 +0000 Received: by mail-qc0-f173.google.com with SMTP id r5so4238661qcx.4 for ; Mon, 21 Apr 2014 11:05:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=mCH+VJTzF0Hwo/HO+MdRJLC3y+2ZhZt/qELFKKWo3bY=; b=M0E2+E+EZjrlJIMvTDCGbOZz2VZ8PPe+iPHgiTE8K2/t8gHZSpsL4xHF/ZmA86aiNo /WmKYHb1mc92CAmt1dG3orVQ1zCG3xRok7YEvOzHDON+6c7OvkOOUiL/PuLH2meXNErF 0ANZgrKm36n8e+3B3KsuHcISnlhr0p9Wlyhst7TqRLV8AA6f5ZuDCjQOIlL3W/kmzUrk OMdjvY/HILJWjD8wBoXILQhNtkbW2MjsFx0X29ABEj0zRhRW0mmnvKullfWniuB0ueWo w64tI+AeEL8ZQ/vGX44UFbw0ETqDv7a+qwGFLlQuS4Y8+IUqyS8r9HvGOZ3KHZGHVPmX vmSA== X-Gm-Message-State: ALoCoQkzkLhLdGmEwOTFR5utyW0PhNWxPH6S02PEhzoDdp6Ssnqb9lXYafzOpOJmS9CMs5phcAtC MIME-Version: 1.0 X-Received: by 10.140.22.197 with SMTP id 63mr45489098qgn.4.1398103535295; Mon, 21 Apr 2014 11:05:35 -0700 (PDT) Received: by 10.96.38.200 with HTTP; Mon, 21 Apr 2014 11:05:35 -0700 (PDT) In-Reply-To: References: Date: Mon, 21 Apr 2014 11:05:35 -0700 Message-ID: Subject: Re: How to get specified rows and avoid full table scanning? From: James Taylor To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c13f1494ac9b04f791580b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c13f1494ac9b04f791580b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Tao, Just wanted to give you a couple of relevant pointers to Apache Phoenix for your particular problem: - Preventing hotspotting by salting your table: http://phoenix.incubator.apache.org/salted.html - Pig Integration for your map/reduce job: http://phoenix.incubator.apache.org/pig_integration.html What kind of processing will you be doing in your map-reduce job? FWIW, Phoenix will allow you to run SQL queries directly over your data, so that might be an alternative for some of the processing you need to do. Thanks, James On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari < jean-marc@spaggiari.org> wrote: > Hi Tao, > > also, if you are thinking about time series, you can take a look at TSBD > http://opentsdb.net/ > > JM > > > 2014-04-21 11:56 GMT-04:00 Ted Yu : > > > There're several alternatives. > > One of which is HBaseWD : > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspottin= g-despite-writing-records-with-sequential-keys/ > > > > You can also take a look at Phoenix. > > > > Cheers > > > > > > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao > > wrote: > > > > > I have a big table and rows will be added to this table each day. I > wanna > > > run a MapReduce job over this table and select rows of several days a= s > > the > > > job's input data. How can I achieve this? > > > > > > If I prefix the rowkey with the date, I can easily select one day's > data > > as > > > the job's input, but this will involve hot spot problem because > hundreds > > of > > > millions of rows will be added to this table each day and the data wi= ll > > > probably go to a single region server. Secondary index would be good > for > > > query but not good for a batch processing job. > > > > > > Are there any other ways? > > > > > > Are there any other frameworks which can achieve this goal easieruser= ? > > > Shark? Stinger=EF=BC=9FHSearch? > > > > > > --001a11c13f1494ac9b04f791580b--