hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandy Pratt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8691) High-Throughput Streaming Scan API
Date Wed, 05 Jun 2013 17:50:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676166#comment-13676166

Sandy Pratt commented on HBASE-8691:


Perfectly normal questions that I should have addressed in the initial

I used the servlet as an expedient way of adding an API to HBase without
taking the time to fully understand how HRegionServer uses its associated
RPC server.  I do think that a streaming scan API should be added to the
normal HRegionServer interface, but I don't know how to do that yet, and
it didn't seem critical to validating my performance hypothesis.  I also
wanted to make sure that there's no point where we wait for the full
result before starting to return to the client.

I'm not familiar with the work you're referring to about framing of
results, but I did find that it's critical to do as little encoding of the
stream as possible.  For example, I tried one approach where I
deserialized the cell on the server, then re-encapsulated it and send it
down to the client.  That was apparently too much work in a tight loop,
and my performance wasn't much better that with a normal scan.  Using the
length-encoded byte stream had a huge impact on performance for me.
Obviously there's only so many cycles to spend between getting the result
from the InternalScanner and putting it on the wire before you start
starving the pipe to the client, but I was surprised at just how few there
actually are.  I would have thought there was time to muck around with
protobuf, but no.

One thing I left on the table here is pushing the output stream down to
InternalScanner so that it can stream results directly to the client. As
is, it marshals a batch and then puts them on the wire (I tested with scan
caching 5000 and scan batch 5000).  That's potentially inefficient, I


> High-Throughput Streaming Scan API
> ----------------------------------
>                 Key: HBASE-8691
>                 URL: https://issues.apache.org/jira/browse/HBASE-8691
>             Project: HBase
>          Issue Type: Improvement
>          Components: Scanners
>    Affects Versions: 0.95.0
>            Reporter: Sandy Pratt
>              Labels: perfomance, scan
>         Attachments: HRegionServlet.java, README.txt, RecordReceiver.java, ScannerTest.java,
StreamHRegionServer.java, StreamReceiverDirect.java, StreamServletDirect.java
> I've done some working testing various ways to refactor and optimize Scans in HBase,
and have found that performance can be dramatically increased by the addition of a streaming
scan API.  The attached code constitutes a proof of concept that shows performance increases
of almost 4x in some workloads.
> I'd appreciate testing, replication, and comments.  If the approach seems viable, I think
such an API should be built into some future version of HBase.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message