hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject Re: Poor HBase map-reduce scan performance
Date Wed, 05 Jun 2013 01:11:44 GMT
Haven't had a chance to write a JIRA yet, but I thought I'd pop in here
with an update in the meantime.

I tried a number of different approaches to eliminate latency and
"bubbles" in the scan pipeline, and eventually arrived at adding a
streaming scan API to the region server, along with refactoring the scan
interface into an event-drive message receiver interface.  In so doing, I
was able to take scan speed on my cluster from 59,537 records/sec with the
classic scanner to 222,703 records per second with my new scan API.
Needless to say, I'm pleased ;)

More details forthcoming when I get a chance.

Thanks,
Sandy

On 5/23/13 3:47 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:

>Thanks for the update, Sandy.
>
>If you can open a JIRA and attach your producer / consumer scanner there,
>that would be great.
>
>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prattrs@adobe.com> wrote:
>
>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to
>> keep the client fed with a full buffer as much as possible.  When
>>scanning
>> my table with scanner caching at 100 records, I see about a 24% uplift
>>in
>> performance (~35k records/sec with the ClientScanner and ~44k
>>records/sec
>> with my P/C scanner).  However, when I set scanner caching to 5000, it's
>> more of a wash compared to the standard ClientScanner: ~53k records/sec
>> with the ClientScanner and ~60k records/sec with the P/C scanner.
>>
>> I'm not sure what to make of those results.  I think next I'll shut down
>> HBase and read the HFiles directly, to see if there's a drop off in
>> performance between reading them directly vs. via the RegionServer.
>>
>> I still think that to really solve this there needs to be sliding window
>> of records in flight between disk and RS, and between RS and client.
>>I'm
>> thinking there's probably a single batch of records in flight between RS
>> and client at the moment.
>>
>> Sandy
>>
>> On 5/23/13 8:45 AM, "Bryan Keller" <bryanck@gmail.com> wrote:
>>
>> >I am considering scanning a snapshot instead of the table. I believe
>>this
>> >is what the ExportSnapshot class does. If I could use the scanning code
>> >from ExportSnapshot then I will be able to scan the HDFS files directly
>> >and bypass the regionservers. This could potentially give me a huge
>>boost
>> >in performance for full table scans. However, it doesn't really address
>> >the poor scan performance against a table.
>>
>>


Mime
View raw message