hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to get specified rows and avoid full table scanning?
Date Wed, 23 Apr 2014 15:55:41 GMT
As you might have read from
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/,
HBaseWD
aims to get good scan performance by reading records from multiple regions
(see Scan section below figure 3).

BTW OpenRSDB above was a typo, it should have been OpenTSDB

Cheers


On Wed, Apr 23, 2014 at 8:09 AM, Tao Xiao <xiaotao.cs.nju@gmail.com> wrote:

> Hi all,
>
> Thank you all for your replies. After examining HBaseWD, OpenRSDB and
> Phoenix, I feel HBaseWD should  meet my requirements.
>
> My business is as follows:  Tens of millions of rows are appended to a
> table and each row has a date property, say 2014-04-01. I will submit a
> MapReduce job, whose input is some days of rows from that table, so I need
> to filter out rows of days other than what I specify. If the date is stored
> as part of the row key, I hope I can use a scan specifying the start and
> end key. At the same time measures should be taken to prevent the hot spot
> problem from happening, because naturally time-series row keys tend to be
> stored contiguously.
>
> HBaseWD avoids the problem of hot spot by decorating the original row key
> with a prefix.
>
> HBaseWD also makes it possibly for a MapReduce job to process data of a
> specified range (by creating a scan instance and pass it a *startKey* and a
> *stopKey*), *but I'm not sure whether this would trigger a full table
> scan*.
>
>
>
>
>
> 2014-04-22 2:05 GMT+08:00 James Taylor <jtaylor@salesforce.com>:
>
> > Tao,
> > Just wanted to give you a couple of relevant pointers to Apache Phoenix
> for
> > your particular problem:
> > - Preventing hotspotting by salting your table:
> > http://phoenix.incubator.apache.org/salted.html
> > - Pig Integration for your map/reduce job:
> > http://phoenix.incubator.apache.org/pig_integration.html
> >
> > What kind of processing will you be doing in your map-reduce job? FWIW,
> > Phoenix will allow you to run SQL queries directly over your data, so
> that
> > might be an alternative for some of the processing you need to do.
> >
> > Thanks,
> > James
> >
> >
> > On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Tao,
> > >
> > > also, if you are thinking about time series, you can take a look at
> TSBD
> > > http://opentsdb.net/
> > >
> > > JM
> > >
> > >
> > > 2014-04-21 11:56 GMT-04:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > > > There're several alternatives.
> > > > One of which is HBaseWD :
> > > >
> > > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > > >
> > > > You can also take a look at Phoenix.
> > > >
> > > > Cheers
> > > >
> > > >
> > > > On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao <xiaotao.cs.nju@gmail.com>
> > > > wrote:
> > > >
> > > > > I have a big table and rows will be added to this table each day.
I
> > > wanna
> > > > > run a MapReduce job over this table and select rows of several days
> > as
> > > > the
> > > > > job's input data. How can I achieve this?
> > > > >
> > > > > If I prefix the rowkey with the date, I can easily select one day's
> > > data
> > > > as
> > > > > the job's input, but this will involve hot spot problem because
> > > hundreds
> > > > of
> > > > > millions of rows will be added to this table each day and the data
> > will
> > > > > probably go to a single region server. Secondary index would be
> good
> > > for
> > > > > query but not good for a batch processing job.
> > > > >
> > > > > Are there any other ways?
> > > > >
> > > > > Are there any other frameworks which can achieve this goal
> > easieruser?
> > > > > Shark? Stinger´╝čHSearch?
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message