hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-15576) Support stateless scanning and scanning cursor
Date Fri, 01 Apr 2016 15:27:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221841#comment-15221841
] 

stack edited comment on HBASE-15576 at 4/1/16 3:27 PM:
-------------------------------------------------------

Hello [~yangzhe1991]

bq. Now for ResultScanner.next(), we may block for longer time larger than timeout settings
to get a Result if the row is very large, or filter is sparse, or there are too many delete
markers in files.

Why you say the above? With Scanner chunking, don't we return early if we hit thresholds (size,
time?). Re: https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1 But perhaps
you are referring to the case where even with this in place, we can go on longer than a timeout?

What will the Cursor return? The top of the merge sort heap and the mvcc? Yeah, HBASE-13099,
doing like dynamodb relates here (Enis comment on 1. and 2. in particular). The Cursor would
be maintained internally? You see any need of our exposing it to the client?

bq. Only one rpc like small scanning, not supporting batch/partials and cursor is row level.
It is simple to implementation.

How would this work against a big row?

On your 'another question', if the Put/Delete are after your Scan start, then you will not
see them because of mvcc. If move/split, the mvcc should still apply; if it doesn't, it is
a bug. The issue as to what happens when a delete comes in after the Scan starts and then
a compaction runs clearing your Cell -- and the Scan is totally stateless -- is a problem.
I think that even for stateful scanners, if a move and then a compaction, it could fail to
read Cells that were present when the Scan started (Having Scanner 'degrade to row level'
-- i.e. change how its working mid-scan -- would be erratic behavior and not sure how you'd
throw an exception for a Cell you don't is missing).

Resetting the seeks on the server-side is expensive but you know this already I'd say.




was (Author: stack):
Hello [~yangzhe1991]

bq. Now for ResultScanner.next(), we may block for longer time larger than timeout settings
to get a Result if the row is very large, or filter is sparse, or there are too many delete
markers in files.

Why you say the above? With Scanner chunking, don't we return early if we hit thresholds (size,
time?). Re: https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1 But perhaps
you are referring to the case where even with this in place, we can go on longer than a timeout?

What will the Cursor return? The top of the merge sort heap and the mvcc? Yeah, HBASE-13099,
doing like dynamodb relates here. The Cursor would be maintained internally? You see any need
of our exposing it to the client?

bq. Only one rpc like small scanning, not supporting batch/partials and cursor is row level.
It is simple to implementation.

How would this work against a big row?

On your 'another question', if the Put/Delete are after your Scan start, then you will not
see them because of mvcc. If move/split, the mvcc should still apply; if it doesn't, it is
a bug. The issue as to what happens when a delete comes in after the Scan starts and then
a compaction runs clearing your Cell -- and the Scan is totally stateless -- is a problem.
I think that even for stateful scanners, if a move and then a compaction, it could fail to
read Cells that were present when the Scan started (Having Scanner 'degrade to row level'
-- i.e. change how its working mid-scan -- would be erratic behavior and not sure how you'd
throw an exception for a Cell you don't is missing).

Resetting the seeks on the server-side is expensive but you know this already I'd say.



> Support stateless scanning and scanning cursor
> ----------------------------------------------
>
>                 Key: HBASE-15576
>                 URL: https://issues.apache.org/jira/browse/HBASE-15576
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Phil Yang
>            Assignee: Phil Yang
>
> After 1.1.0 released, we have partial and heartbeat protocol in scanning to prevent responding
large data or timeout. Now for ResultScanner.next(), we may block for longer time larger than
timeout settings to get a Result if the row is very large, or filter is sparse, or there are
too many delete markers in files.
> However, in some scenes, we don't want it to be blocked for too long. For example, a
web service which handles requests from mobile devices whose network is not stable and we
can not set timeout too long(eg. only 5 seconds) between mobile and web service. This service
will scan rows from HBase and return it to mobile devices. In this scene, the simplest way
is to make the web service stateless. Apps in mobile devices will send several requests one
by one to get the data until enough just like paging a list. In each request it will carry
a start position which depends on the last result from web service. Different requests can
be sent to different web service server because it is stateless.
> Therefore, the stateless web service need a cursor from HBase telling where we have scanned
in RegionScanner when HBase client receives an empty heartbeat. And the service will return
the cursor to mobile device although the response has no data. In next request we can start
at the position of cursor, without the cursor we have to scan from last returned result and
we may timeout forever. And of course even if the heartbeat message is not empty we can still
use cursor to prevent re-scan the same rows/cells which has beed skipped.
> Obviously, we will give up consistency for scanning because even HBase client is also
stateless, but it is acceptable in this scene. And maybe we can keep mvcc in cursor so we
can get a consistent view?
> HBASE-13099 had some discussion, but it has no further progress by now.
> API:
> In Scan we need a new method setStateless to make the scanning stateless and need another
timeout setting for stateless scanning. In this mode we will not block ResultScanner.next()
longer than this timeout setting. And we will return Results in next() as usual but the last
Result (or only Result if we receive empty heartbeat) has a special flag to mark it a cursor.
The cursor Result has only one Cell. Users can scan like this:
> {code}
> while( r = scanner.next() && r != null && !r.isCursor()){
>     //just like before
> }
> if(r != null){
>     // scanning is not end, it is a cursor
> } else {
>     // scanning is end
> }
> scanner.close()
> {code}
> Implementation:
> We will have two options to support stateless scanning: 
> Only one rpc like small scanning, not supporting batch/partials and cursor is row level.
It is simple to implementation.
> Support big scanning with several rpc requests, supporting batch/partials and cursor
is cell level. It is a little complex because we need seek at server side and we should make
sure total time of rpc requests not exceed timeout setting.
> Or we can make it by two phases, support one-shot first?
> Any thoughts? Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message