hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (HBASE-11295) Long running scan produces OutOfOrderScannerNextException
Date Wed, 17 Dec 2014 22:39:14 GMT

     [ https://issues.apache.org/jira/browse/HBASE-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrew Purtell reopened HBASE-11295:

We can get an OutOfOrderScannerNextException if the server thinks it has processed a scanner
‘next’ call but the client does not, and retries that ‘next’ RPC, which happens to
fail again even though technically it's using a new (relocated) scanner.

When the client gets a OutOfOrderScannerNextException, the ClientScanner will retry - once.
We use the boolean control variable {{retryAfterOutOfOrderException}}, set to 'true' initially,
then set to 'false' when looping back to relocate and retry. 

A comment in ScannerCallable#next says: "_If at the server side fetching of next batch of
data was over, there will be mismatch in the nextCallSeq number. Server will throw OutOfOrderScannerNextException
and then client will reopen the scanner with start row as the last successfully retrieved
row._” This is what happens. We set ‘callable' to null before looping back around, so
nextScanner() will create a new ScannerCallable. The new ScannerCallable does not have an
initialized ‘scannerId’ so it builds a scan open request and sends it to the server. On
the server side, this creates a new RegionScanner with a new identifier. This is like starting
the scan over, except the start row has been updated to the last position of the previous
so from the application perspective the result stream is seamless. Both the new RegionScanner
and the ScannerCallable on the client restart with nextCallSeq values of 0. 

Now with the new scanner we run into bad luck. With the new scanner on this "retry" this request
times out like the first one, again with the server thinking the client should have advanced.
However inside the ClientScanner state the value of retryAfterOutOfOrderException is ‘false',
so this time we let out the OutOfOrderScannerNextException exception to bubble up to the application,
"expecting nextCallSeq 1, got 0"

I could be missing something. If not, this doesn’t seem quite right. We are using a new
scanner after relocation, like we do for NSREs, that just happens to fail the same way as
the last one due, perhaps due to socket timeout sending the response under similar prevailing
conditions. Why have the special case handling controlled by retryAfterOutOfOrderException?
We retry NSREs up to a configured threshold, then give up. Use the same threshold for OutOfOrderScannerNextExceptions?


> Long running scan produces OutOfOrderScannerNextException
> ---------------------------------------------------------
>                 Key: HBASE-11295
>                 URL: https://issues.apache.org/jira/browse/HBASE-11295
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.96.0
>            Reporter: Jeff Cunningham
>         Attachments: OutOfOrderScannerNextException.tar.gz
> Attached Files:
> HRegionServer.java - instramented from
> HBaseLeaseTimeoutIT.java - reproducing JUnit 4 test
> WaitFilter.java - Scan filter (extends FilterBase) that overrides filterRowKey() to sleep
during invocation
> SpliceFilter.proto - Protobuf defintiion for WaitFilter.java
> OutOfOrderScann_InstramentedServer.log - instramented server log
> Steps.txt - this note
> Set up:
> In HBaseLeaseTimeoutIT, create a scan, set the given filter (which sleeps in overridden
filterRowKey() method) and set it on the scan, and scan the table.
> This is done in test client_0x0_server_150000x10().
> Here's what I'm seeing (see also attached log):
> A new request comes into server (ID 1940798815214593802 - RpcServer.handler=96) and a
RegionScanner is created for it, cached by ID, immediately looked up again and cached RegionScannerHolder's
nextCallSeq incremeted (now at 1).
> The RegionScan thread goes to sleep in WaitFilter#filterRowKey().
> A short (variable) period later, another request comes into the server (ID 8946109289649235722
- RpcServer.handler=98) and the same series of events happen to this request.
> At this point both RegionScanner threads are sleeping in WaitFilter.filterRowKey(). After
another period, the client retries another scan request which thinks its next_call_seq is
0.  However, HRegionServer's cached RegionScannerHolder thinks the matching RegionScanner's
nextCallSeq should be 1.

This message was sent by Atlassian JIRA

View raw message