Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 843E610D90 for ; Wed, 5 Jun 2013 17:50:22 +0000 (UTC) Received: (qmail 6193 invoked by uid 500); 5 Jun 2013 17:50:22 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 5964 invoked by uid 500); 5 Jun 2013 17:50:22 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 5897 invoked by uid 99); 5 Jun 2013 17:50:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 17:50:21 +0000 Date: Wed, 5 Jun 2013 17:50:21 +0000 (UTC) From: "Sandy Pratt (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8691) High-Throughput Streaming Scan API MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676166#comment-13676166 ] Sandy Pratt commented on HBASE-8691: ------------------------------------ Stack, Perfectly normal questions that I should have addressed in the initial post. I used the servlet as an expedient way of adding an API to HBase without taking the time to fully understand how HRegionServer uses its associated RPC server. I do think that a streaming scan API should be added to the normal HRegionServer interface, but I don't know how to do that yet, and it didn't seem critical to validating my performance hypothesis. I also wanted to make sure that there's no point where we wait for the full result before starting to return to the client. I'm not familiar with the work you're referring to about framing of results, but I did find that it's critical to do as little encoding of the stream as possible. For example, I tried one approach where I deserialized the cell on the server, then re-encapsulated it and send it down to the client. That was apparently too much work in a tight loop, and my performance wasn't much better that with a normal scan. Using the length-encoded byte stream had a huge impact on performance for me. Obviously there's only so many cycles to spend between getting the result from the InternalScanner and putting it on the wire before you start starving the pipe to the client, but I was surprised at just how few there actually are. I would have thought there was time to muck around with protobuf, but no. One thing I left on the table here is pushing the output stream down to InternalScanner so that it can stream results directly to the client. As is, it marshals a batch and then puts them on the wire (I tested with scan caching 5000 and scan batch 5000). That's potentially inefficient, I think. Sandy > High-Throughput Streaming Scan API > ---------------------------------- > > Key: HBASE-8691 > URL: https://issues.apache.org/jira/browse/HBASE-8691 > Project: HBase > Issue Type: Improvement > Components: Scanners > Affects Versions: 0.95.0 > Reporter: Sandy Pratt > Labels: perfomance, scan > Attachments: HRegionServlet.java, README.txt, RecordReceiver.java, ScannerTest.java, StreamHRegionServer.java, StreamReceiverDirect.java, StreamServletDirect.java > > > I've done some working testing various ways to refactor and optimize Scans in HBase, and have found that performance can be dramatically increased by the addition of a streaming scan API. The attached code constitutes a proof of concept that shows performance increases of almost 4x in some workloads. > I'd appreciate testing, replication, and comments. If the approach seems viable, I think such an API should be built into some future version of HBase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira