hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject Re: Poor HBase map-reduce scan performance
Date Wed, 05 Jun 2013 08:09:22 GMT
https://issues.apache.org/jira/browse/HBASE-8691


On 6/4/13 6:11 PM, "Sandy Pratt" <prattrs@adobe.com> wrote:

>Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
>with an update in the meantime.
>
>I tried a number of different approaches to eliminate latency and
>"bubbles" in the scan pipeline, and eventually arrived at adding a
>streaming scan API to the region server, along with refactoring the scan
>interface into an event-drive message receiver interface.  In so doing, I
>was able to take scan speed on my cluster from 59,537 records/sec with the
>classic scanner to 222,703 records per second with my new scan API.
>Needless to say, I'm pleased ;)
>
>More details forthcoming when I get a chance.
>
>Thanks,
>Sandy
>
>On 5/23/13 3:47 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>
>>Thanks for the update, Sandy.
>>
>>If you can open a JIRA and attach your producer / consumer scanner there,
>>that would be great.
>>
>>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prattrs@adobe.com> wrote:
>>
>>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
>>> keep the client fed with a full buffer as much as possible.  When
>>>scanning
>>> my table with scanner caching at 100 records, I see about a 24% uplift
>>>in
>>> performance (~35k records/sec with the ClientScanner and ~44k
>>>records/sec
>>> with my P/C scanner).  However, when I set scanner caching to 5000,
>>>it's
>>> more of a wash compared to the standard ClientScanner: ~53k records/sec
>>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>>
>>> I'm not sure what to make of those results.  I think next I'll shut
>>>down
>>> HBase and read the HFiles directly, to see if there's a drop off in
>>> performance between reading them directly vs. via the RegionServer.
>>>
>>> I still think that to really solve this there needs to be sliding
>>>window
>>> of records in flight between disk and RS, and between RS and client.
>>>I'm
>>> thinking there's probably a single batch of records in flight between
>>>RS
>>> and client at the moment.
>>>
>>> Sandy
>>>
>>> On 5/23/13 8:45 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>>
>>> >I am considering scanning a snapshot instead of the table. I believe
>>>this
>>> >is what the ExportSnapshot class does. If I could use the scanning
>>>code
>>> >from ExportSnapshot then I will be able to scan the HDFS files
>>>directly
>>> >and bypass the regionservers. This could potentially give me a huge
>>>boost
>>> >in performance for full table scans. However, it doesn't really
>>>address
>>> >the poor scan performance against a table.
>>>
>>>
>


Mime
View raw message