hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tao Xiao <xiaotao.cs....@gmail.com>
Subject Re: How to get specified rows and avoid full table scanning?
Date Wed, 23 Apr 2014 15:09:19 GMT
Hi all,

Thank you all for your replies. After examining HBaseWD, OpenRSDB and
Phoenix, I feel HBaseWD should  meet my requirements.

My business is as follows:  Tens of millions of rows are appended to a
table and each row has a date property, say 2014-04-01. I will submit a
MapReduce job, whose input is some days of rows from that table, so I need
to filter out rows of days other than what I specify. If the date is stored
as part of the row key, I hope I can use a scan specifying the start and
end key. At the same time measures should be taken to prevent the hot spot
problem from happening, because naturally time-series row keys tend to be
stored contiguously.

HBaseWD avoids the problem of hot spot by decorating the original row key
with a prefix.

HBaseWD also makes it possibly for a MapReduce job to process data of a
specified range (by creating a scan instance and pass it a *startKey* and a
*stopKey*), *but I'm not sure whether this would trigger a full table scan*.





2014-04-22 2:05 GMT+08:00 James Taylor <jtaylor@salesforce.com>:

> Tao,
> Just wanted to give you a couple of relevant pointers to Apache Phoenix for
> your particular problem:
> - Preventing hotspotting by salting your table:
> http://phoenix.incubator.apache.org/salted.html
> - Pig Integration for your map/reduce job:
> http://phoenix.incubator.apache.org/pig_integration.html
>
> What kind of processing will you be doing in your map-reduce job? FWIW,
> Phoenix will allow you to run SQL queries directly over your data, so that
> might be an alternative for some of the processing you need to do.
>
> Thanks,
> James
>
>
> On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Tao,
> >
> > also, if you are thinking about time series, you can take a look at TSBD
> > http://opentsdb.net/
> >
> > JM
> >
> >
> > 2014-04-21 11:56 GMT-04:00 Ted Yu <yuzhihong@gmail.com>:
> >
> > > There're several alternatives.
> > > One of which is HBaseWD :
> > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > You can also take a look at Phoenix.
> > >
> > > Cheers
> > >
> > >
> > > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> > > wrote:
> > >
> > > > I have a big table and rows will be added to this table each day. I
> > wanna
> > > > run a MapReduce job over this table and select rows of several days
> as
> > > the
> > > > job's input data. How can I achieve this?
> > > >
> > > > If I prefix the rowkey with the date, I can easily select one day's
> > data
> > > as
> > > > the job's input, but this will involve hot spot problem because
> > hundreds
> > > of
> > > > millions of rows will be added to this table each day and the data
> will
> > > > probably go to a single region server. Secondary index would be good
> > for
> > > > query but not good for a batch processing job.
> > > >
> > > > Are there any other ways?
> > > >
> > > > Are there any other frameworks which can achieve this goal
> easieruser?
> > > > Shark? Stinger´╝čHSearch?
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message