Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 13 May 2015 15:17:03 +0000 (UTC)
From: "stack (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12776140.1424350154000.102747.1431530223047@Atlassian.JIRA>
In-Reply-To: <JIRA.12776140.1424350154000@Atlassian.JIRA>
References: <JIRA.12776140.1424350154000@Atlassian.JIRA>
 <JIRA.12776140.1424350154430@arcas>
Subject: [jira] [Updated] (HBASE-13071) Hbase Streaming Scan Feature
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HBASE-13071?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-13071:
--------------------------
    Release Note:=20
MOTIVATION

A pipelined scan API is introduced for speeding up applications that combin=
e massive data traversal with compute-intensive processing. Traditional HBa=
se scans save network trips through prefetching the data to the client side=
 cache. However, they prefetch synchronously: the fetch request to regionse=
rver is invoked only when the entire cache is consumed. This leads to a sto=
p-and-wait access pattern, in which the client stalls until the next chunk =
of data is fetched. Applications that do significant processing can benefit=
 from background data prefetching, which eliminates this bottleneck. The pi=
pelined scan implementation overlaps the cache population at the client sid=
e with application processing. Namely, it issues a new scan RPC when the it=
eration retrieves 50% of the cache. If the application processing (that is,=
 the time between invocations of next()) is substantial, the new chunk of d=
ata will be available before the previous one is exhausted, and the client =
will not experience any delay. Ideally, the prefetch and the processing tim=
es should be balanced.=20

API AND CONFIGURATION
Asynchronous scanning can be configured either globally for all tables and =
scans, or on per-scan basis via a new Scan class API.=20
h4. Configuration in hbase-site.xml
hbase.client.scanner.async.prefetch, default false:
{code}
<property>
  <name>hbase.client.scanner.async.prefetch</name>
  <value>true</value>
</property>
{code}
h4. API - Scan#setAsyncPrefetch(boolean)
{code}
      Scan scan =3D new Scan();
      scan.setCaching(1000);
      scan.getMaxResultSize(BIG_SIZE);
      scan.setAsyncPrefetch(true);
        ...
      ResultScanner scanner =3D table.getScanner(scan);
{code}

IMPLEMENTATION NOTES
Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, wh=
ich is fully API-compatible with the synchronous ClientSimpleScanner. Clien=
tAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) =
and reversed (Scan#setReversed) scanners. The application is responsible fo=
r setting the prefetch size in a way that the prefetch time and the process=
ing times are balanced. Note that due to double buffering, the client side =
cache can use twice as much memory as the synchronous scanner.

  was:
h3. Motivation

A pipelined scan API is introduced for speeding up applications that combin=
e massive data traversal with compute-intensive processing. Traditional HBa=
se scans save network trips through prefetching the data to the client side=
 cache. However, they prefetch synchronously: the fetch request to regionse=
rver is invoked only when the entire cache is consumed. This leads to a sto=
p-and-wait access pattern, in which the client stalls until the next chunk =
of data is fetched. Applications that do significant processing can benefit=
 from background data prefetching, which eliminates this bottleneck. The pi=
pelined scan implementation overlaps the cache population at the client sid=
e with application processing. Namely, it issues a new scan RPC when the it=
eration retrieves 50% of the cache. If the application processing (that is,=
 the time between invocations of next()) is substantial, the new chunk of d=
ata will be available before the previous one is exhausted, and the client =
will not experience any delay. Ideally, the prefetch and the processing tim=
es should be balanced.=20

h3. API and Configuration
Asynchronous scanning can be configured either globally for all tables and =
scans, or on per-scan basis via a new Scan class API.=20
h4. Configuration in hbase-site.xml
hbase.client.scanner.async.prefetch, default false:
{code}
<property>
  <name>hbase.client.scanner.async.prefetch</name>
  <value>true</value>
</property>
{code}
h4. API - Scan#setAsyncPrefetch(boolean)
{code}
      Scan scan =3D new Scan();
      scan.setCaching(1000);
      scan.getMaxResultSize(BIG_SIZE);
      scan.setAsyncPrefetch(true);
        ...
      ResultScanner scanner =3D table.getScanner(scan);
{code}

h3. Implementation Notes
Pipelined scan is implemented by a new ClientAsyncPrefetchScanner class, wh=
ich is fully API-compatible with the synchronous ClientSimpleScanner. Clien=
tAsyncPrefetchScanner is not instantiated in case of small (Scan#setSmall) =
and reversed (Scan#setReversed) scanners. The application is responsible fo=
r setting the prefetch size in a way that the prefetch time and the process=
ing times are balanced. Note that due to double buffering, the client side =
cache can use twice as much memory as the synchronous scanner.


> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Eshcar Hillel
>            Assignee: Eshcar Hillel
>         Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_t=
runk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-=
13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, =
HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.p=
atch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-=
13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingS=
canEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Rel=
easenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hi=
ts.eshcar.png, hits.png, latency.delay.png, latency.png, network.png
>
>
> A scan operation iterates over all rows of a table or a subrange of the t=
able. The synchronous nature in which the data is served at the client side=
 hinders the speed the application traverses the data: it increases the ove=
rall processing time, and may cause a great variance in the times the appli=
cation waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the region=
server and then stores the results in a cache. The application can specify =
how many rows will be transmitted per RPC; by default this is set to 100 ro=
ws.=20
> The cache can be considered as a producer-consumer queue, where the hbase=
 client pushes the data to the queue and the application consumes it. Curre=
ntly this queue is synchronous, i.e., blocking. More specifically, when the=
 application consumed all the data from the cache --- so the cache is empty=
 --- the hbase client retrieves additional data from the server and re-fill=
s the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced=
 by the time it takes to retrieve the data, an asynchronous approach can re=
duce the time the application is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluati=
on results of this code.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)