hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Lawlor (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
Date Tue, 03 Mar 2015 23:35:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346012#comment-14346012
] 

Jonathan Lawlor commented on HBASE-13071:
-----------------------------------------

[~eshcar] I have added some review below to follow up on stack's comments:

When the defining the capacity for the concurrent queue as below:
{code:title=ClientAsyncPrefetchScanner.java}
...
protected void initCache() {
  // concurrent cache
  // double buffer - double cache size
  cache = new LinkedBlockingQueue<Result>(this.caching*2 + 1);
}
...
{code}
we need to check the size of caching first to make sure that overflow does not occur. For
example, in the case that this.caching > Integer.Max_Value / 2, this will throw an IllegalArgumentException.
This is important in the case that the user has configured Scan#caching=Integer.Max_Value
and Scan#maxResultSize to be a nice chunk size (this configuration is used in instances where
the user wants to receives responses of a certain heap size from the server rather than responses
with a certain number of rows).

When close() is called and the prefetch is running we still need to end up calling super.close()
at some point. In ClientScanner, the call to close() ensures that the RegionScanner is closed
on the server side so it is important that we do not miss this call. 

The javadoc on the async ClientScanner seems to indicate that the prefetch will be issued
when the cache is half full, but it looks like the cache size check is using caching rather
than caching / 2. My guess is that the first two calls to ClientScanner#next() would both
kick off RPC calls. The first would fetch the initial chunk containing caching number of rows,
and the second call to next would kick off a prefetch (since one Result was consumed by first
call and thus cache size will be caching - 1).

Some javadoc on the async parameter inside Scan.java may be helpful just to clarify how the
parameter is used. For example, the parameter currently won't have any effect in the case
that the user has set Scan#setSmall or Scan#setReversed

Looks like there may be some minor formatting issues that are still hanging around in the
latest patch (e.g. Tabs should be 2 spaces instead of 4). You may have already seen it, but
in the link [~stack] pointed out, there is mention of a plugin that can be used with IntelliJ
to let eclipse formatters work with it; any luck with that? (having the formatter in the IDE
avoids headaches :))

Looking forward to getting this one in !

> Hbase Streaming Scan Feature
> ----------------------------
>
>                 Key: HBASE-13071
>                 URL: https://issues.apache.org/jira/browse/HBASE-13071
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.98.11
>            Reporter: Eshcar Hillel
>         Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch,
HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf
>
>
> A scan operation iterates over all rows of a table or a subrange of the table. The synchronous
nature in which the data is served at the client side hinders the speed the application traverses
the data: it increases the overall processing time, and may cause a great variance in the
times the application waits for the next piece of data.
> The scanner next() method at the client side invokes an RPC to the regionserver and then
stores the results in a cache. The application can specify how many rows will be transmitted
per RPC; by default this is set to 100 rows. 
> The cache can be considered as a producer-consumer queue, where the hbase client pushes
the data to the queue and the application consumes it. Currently this queue is synchronous,
i.e., blocking. More specifically, when the application consumed all the data from the cache
--- so the cache is empty --- the hbase client retrieves additional data from the server and
re-fills the cache with new data. During this time the application is blocked.
> Under the assumption that the application processing time can be balanced by the time
it takes to retrieve the data, an asynchronous approach can reduce the time the application
is waiting for data.
> We attach a design document.
> We also have a patch that is based on a private branch, and some evaluation results of
this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message