Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5BABD1876C for ; Wed, 13 May 2015 15:17:03 +0000 (UTC) Received: (qmail 24514 invoked by uid 500); 13 May 2015 15:17:03 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 24468 invoked by uid 500); 13 May 2015 15:17:03 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 24456 invoked by uid 99); 13 May 2015 15:17:03 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 15:17:03 +0000 Date: Wed, 13 May 2015 15:17:03 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-13071) Hbase Streaming Scan Feature MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13071?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-13071: -------------------------- Release Note:=20 MOTIVATION A pipelined scan API is introduced for speeding up applications that combin= e massive data traversal with compute-intensive processing. Traditional HBa= se scans save network trips through prefetching the data to the client side= cache. However, they prefetch synchronously: the fetch request to regionse= rver is invoked only when the entire cache is consumed. This leads to a sto= p-and-wait access pattern, in which the client stalls until the next chunk = of data is fetched. Applications that do significant processing can benefit= from background data prefetching, which eliminates this bottleneck. The pi= pelined scan implementation overlaps the cache population at the client sid= e with application processing. Namely, it issues a new scan RPC when the it= eration retrieves 50% of the cache. If the application processing (that is,= the time between invocations of next()) is substantial, the new chunk of d= ata will be available before the previous one is exhausted, and the client = will not experience any delay. Ideally, the prefetch and the processing tim= es should be balanced.=20 API AND CONFIGURATION Asynchronous scanning can be configured either globally for all tables and = scans, or on per-scan basis via a new Scan class API.=20 h4. Configuration in hbase-site.xml hbase.client.scanner.async.prefetch, default false: {code} hbase.client.scanner.async.prefetch true {code} h4. API - Scan#setAsyncPrefetch(boolean) {code} Scan scan =3D new Scan(); scan.setCaching(1000); scan.getMaxResultSize(BIG_SIZE); scan.setAsyncPrefetch(true); ... ResultScanner scanner =3D table.getScanner(scan); {code} IMPLEMENTATION NOTES Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, wh= ich is fully API-compatible with the synchronous ClientSimpleScanner. Clien= tAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) = and reversed (Scan#setReversed) scanners. The application is responsible fo= r setting the prefetch size in a way that the prefetch time and the process= ing times are balanced. Note that due to double buffering, the client side = cache can use twice as much memory as the synchronous scanner. was: h3. Motivation A pipelined scan API is introduced for speeding up applications that combin= e massive data traversal with compute-intensive processing. Traditional HBa= se scans save network trips through prefetching the data to the client side= cache. However, they prefetch synchronously: the fetch request to regionse= rver is invoked only when the entire cache is consumed. This leads to a sto= p-and-wait access pattern, in which the client stalls until the next chunk = of data is fetched. Applications that do significant processing can benefit= from background data prefetching, which eliminates this bottleneck. The pi= pelined scan implementation overlaps the cache population at the client sid= e with application processing. Namely, it issues a new scan RPC when the it= eration retrieves 50% of the cache. If the application processing (that is,= the time between invocations of next()) is substantial, the new chunk of d= ata will be available before the previous one is exhausted, and the client = will not experience any delay. Ideally, the prefetch and the processing tim= es should be balanced.=20 h3. API and Configuration Asynchronous scanning can be configured either globally for all tables and = scans, or on per-scan basis via a new Scan class API.=20 h4. Configuration in hbase-site.xml hbase.client.scanner.async.prefetch, default false: {code} hbase.client.scanner.async.prefetch true {code} h4. API - Scan#setAsyncPrefetch(boolean) {code} Scan scan =3D new Scan(); scan.setCaching(1000); scan.getMaxResultSize(BIG_SIZE); scan.setAsyncPrefetch(true); ... ResultScanner scanner =3D table.getScanner(scan); {code} h3. Implementation Notes Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, wh= ich is fully API-compatible with the synchronous ClientSimpleScanner. Clien= tAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) = and reversed (Scan#setReversed) scanners. The application is responsible fo= r setting the prefetch size in a way that the prefetch time and the process= ing times are balanced. Note that due to double buffering, the client side = cache can use twice as much memory as the synchronous scanner. > Hbase Streaming Scan Feature > ---------------------------- > > Key: HBASE-13071 > URL: https://issues.apache.org/jira/browse/HBASE-13071 > Project: HBase > Issue Type: New Feature > Reporter: Eshcar Hillel > Assignee: Eshcar Hillel > Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_t= runk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-= 13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, = HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.p= atch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-= 13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingS= canEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Rel= easenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hi= ts.eshcar.png, hits.png, latency.delay.png, latency.png, network.png > > > A scan operation iterates over all rows of a table or a subrange of the t= able. The synchronous nature in which the data is served at the client side= hinders the speed the application traverses the data: it increases the ove= rall processing time, and may cause a great variance in the times the appli= cation waits for the next piece of data. > The scanner next() method at the client side invokes an RPC to the region= server and then stores the results in a cache. The application can specify = how many rows will be transmitted per RPC; by default this is set to 100 ro= ws.=20 > The cache can be considered as a producer-consumer queue, where the hbase= client pushes the data to the queue and the application consumes it. Curre= ntly this queue is synchronous, i.e., blocking. More specifically, when the= application consumed all the data from the cache --- so the cache is empty= --- the hbase client retrieves additional data from the server and re-fill= s the cache with new data. During this time the application is blocked. > Under the assumption that the application processing time can be balanced= by the time it takes to retrieve the data, an asynchronous approach can re= duce the time the application is waiting for data. > We attach a design document. > We also have a patch that is based on a private branch, and some evaluati= on results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)