Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com
 designates 209.85.216.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPQV63XFgORYZF8ivT569bzwAZAVhfzMWR_D8yUKFM629dPq+A@mail.gmail.com>
References: 
 <CACUUc6B5znHSLxG6N+4tTWmf3XY0r7sUBRVpHNhR+qSURCxSaQ@mail.gmail.com>
	<CALte62xLsHr8Mw3orEp=J2KsTFjeYu1o_xrgxeN0veSZnBSouw@mail.gmail.com>
	<CAPQV63XFgORYZF8ivT569bzwAZAVhfzMWR_D8yUKFM629dPq+A@mail.gmail.com>
Date: Mon, 21 Apr 2014 11:05:35 -0700
Message-ID: 
 <CAG_TOPBaCuNHogG1AhZsTafZ1sqsnkOJwQZ=Q07OuG41V7RAag@mail.gmail.com>
Subject: Re: How to get specified rows and avoid full table scanning?
From: James Taylor <jtaylor@salesforce.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c13f1494ac9b04f791580b

--001a11c13f1494ac9b04f791580b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Tao,
Just wanted to give you a couple of relevant pointers to Apache Phoenix for
your particular problem:
- Preventing hotspotting by salting your table:
http://phoenix.incubator.apache.org/salted.html
- Pig Integration for your map/reduce job:
http://phoenix.incubator.apache.org/pig_integration.html

What kind of processing will you be doing in your map-reduce job? FWIW,
Phoenix will allow you to run SQL queries directly over your data, so that
might be an alternative for some of the processing you need to do.

Thanks,
James


On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Tao,
>
> also, if you are thinking about time series, you can take a look at TSBD
> http://opentsdb.net/
>
> JM
>
>
> 2014-04-21 11:56 GMT-04:00 Ted Yu <yuzhihong@gmail.com>:
>
> > There're several alternatives.
> > One of which is HBaseWD :
> >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspottin=
g-despite-writing-records-with-sequential-keys/
> >
> > You can also take a look at Phoenix.
> >
> > Cheers
> >
> >
> > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> > wrote:
> >
> > > I have a big table and rows will be added to this table each day. I
> wanna
> > > run a MapReduce job over this table and select rows of several days a=
s
> > the
> > > job's input data. How can I achieve this?
> > >
> > > If I prefix the rowkey with the date, I can easily select one day's
> data
> > as
> > > the job's input, but this will involve hot spot problem because
> hundreds
> > of
> > > millions of rows will be added to this table each day and the data wi=
ll
> > > probably go to a single region server. Secondary index would be good
> for
> > > query but not good for a batch processing job.
> > >
> > > Are there any other ways?
> > >
> > > Are there any other frameworks which can achieve this goal easieruser=
?
> > > Shark? Stinger=EF=BC=9FHSearch?
> > >
> >
>

--001a11c13f1494ac9b04f791580b--