hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yonghu <yongyong...@gmail.com>
Subject Re: Poor HBase map-reduce scan performance
Date Wed, 05 Jun 2013 18:14:29 GMT
Dear Sandy,

Thanks for your explanation.

However, what I don't get is your term "client", is this "client" means
MapReduce jobs? If I understand you right, this means Map function will
process the tuples and during this processing time, the regionserver did
nothing?

regards!

Yong


On Wed, Jun 5, 2013 at 6:12 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. the Regionserver and Tasktracker are the same node when you use
> MapReduce to scan the HBase table.
>
> The scan performed by the Tasktracker on that node would very likely access
> data hosted by region server on other node(s). So there would be RPC
> involved.
>
> There is some discussion on providing shadow reads - writes to specific
> region are solely served by one region server but the reads can be served
> by more than one region server. Of course consistency is one aspect that
> must be tackled.
>
> Cheers
>
> On Wed, Jun 5, 2013 at 7:55 AM, yonghu <yongyong313@gmail.com> wrote:
>
> > Can anyone explain why client + rpc + server will decrease the
> performance
> > of scanning? I mean the Regionserver and Tasktracker are the same node
> when
> > you use MapReduce to scan the HBase table. So, in my understanding, there
> > will be no rpc cost.
> >
> > Thanks!
> >
> > Yong
> >
> >
> > On Wed, Jun 5, 2013 at 10:09 AM, Sandy Pratt <prattrs@adobe.com> wrote:
> >
> > > https://issues.apache.org/jira/browse/HBASE-8691
> > >
> > >
> > > On 6/4/13 6:11 PM, "Sandy Pratt" <prattrs@adobe.com> wrote:
> > >
> > > >Haven't had a chance to write a JIRA yet, but I thought I'd pop in
> here
> > > >with an update in the meantime.
> > > >
> > > >I tried a number of different approaches to eliminate latency and
> > > >"bubbles" in the scan pipeline, and eventually arrived at adding a
> > > >streaming scan API to the region server, along with refactoring the
> scan
> > > >interface into an event-drive message receiver interface.  In so
> doing,
> > I
> > > >was able to take scan speed on my cluster from 59,537 records/sec with
> > the
> > > >classic scanner to 222,703 records per second with my new scan API.
> > > >Needless to say, I'm pleased ;)
> > > >
> > > >More details forthcoming when I get a chance.
> > > >
> > > >Thanks,
> > > >Sandy
> > > >
> > > >On 5/23/13 3:47 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
> > > >
> > > >>Thanks for the update, Sandy.
> > > >>
> > > >>If you can open a JIRA and attach your producer / consumer scanner
> > there,
> > > >>that would be great.
> > > >>
> > > >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prattrs@adobe.com>
> > wrote:
> > > >>
> > > >>> I wrote myself a Scanner wrapper that uses a producer/consumer
> queue
> > to
> > > >>> keep the client fed with a full buffer as much as possible.  When
> > > >>>scanning
> > > >>> my table with scanner caching at 100 records, I see about a 24%
> > uplift
> > > >>>in
> > > >>> performance (~35k records/sec with the ClientScanner and ~44k
> > > >>>records/sec
> > > >>> with my P/C scanner).  However, when I set scanner caching to
5000,
> > > >>>it's
> > > >>> more of a wash compared to the standard ClientScanner: ~53k
> > records/sec
> > > >>> with the ClientScanner and ~60k records/sec with the P/C scanner.
> > > >>>
> > > >>> I'm not sure what to make of those results.  I think next I'll
shut
> > > >>>down
> > > >>> HBase and read the HFiles directly, to see if there's a drop off
in
> > > >>> performance between reading them directly vs. via the RegionServer.
> > > >>>
> > > >>> I still think that to really solve this there needs to be sliding
> > > >>>window
> > > >>> of records in flight between disk and RS, and between RS and
> client.
> > > >>>I'm
> > > >>> thinking there's probably a single batch of records in flight
> between
> > > >>>RS
> > > >>> and client at the moment.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>> On 5/23/13 8:45 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
> > > >>>
> > > >>> >I am considering scanning a snapshot instead of the table.
I
> believe
> > > >>>this
> > > >>> >is what the ExportSnapshot class does. If I could use the
scanning
> > > >>>code
> > > >>> >from ExportSnapshot then I will be able to scan the HDFS files
> > > >>>directly
> > > >>> >and bypass the regionservers. This could potentially give
me a
> huge
> > > >>>boost
> > > >>> >in performance for full table scans. However, it doesn't really
> > > >>>address
> > > >>> >the poor scan performance against a table.
> > > >>>
> > > >>>
> > > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message