Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of tianq01@gmail.com designates
 209.85.216.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAE8tVdkFGwd2BPcT9vrsdiZVy=b0-FDpm60F5zqbVu3yrSgs_w@mail.gmail.com>
References: 
 <CAE8tVdkFGwd2BPcT9vrsdiZVy=b0-FDpm60F5zqbVu3yrSgs_w@mail.gmail.com>
Date: Wed, 15 Oct 2014 17:41:54 +0800
Message-ID: 
 <CAMZUsP56gHt=eaMCf1gN08YNMUdcdTDD9fhY8y4ORvdvs0Z=NA@mail.gmail.com>
Subject: Re: Slow Scan can loop forever
From: Qiang Tian <tianq01@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c2b7082f2540050572f16a

--001a11c2b7082f2540050572f16a
Content-Type: text/plain; charset=UTF-8

James,
thanks for the details, an interesting case.

It looks to me the OutOfOrderScannerNextException thrown by RS is correct
-- note the internal cursor in scanner has moved on, even if the client
side retries, client will not get the correct data(will lose data that the
scanner already scanned)
the dead loop is caused by "retryAfterOutOfOrderException  = true;" at the
last line of the try block.
not sure why it is set true there. but looking at below code it looks it
just want to retry once?

            if (e instanceof OutOfOrderScannerNextException) {
              if (retryAfterOutOfOrderException) {
                retryAfterOutOfOrderException = false;
              } else {
                // TODO: Why wrap this in a DNRIOE when it already is a
DNRIOE?
                throw new DoNotRetryIOException("Failed after retry of " +
                  "OutOfOrderScannerNextException: was there a rpc
timeout?", e);
              }
            }


and it looks we throw OutOfOrderScannerNextException here

            } else {
              // If exception is any but the list below throw it back to
the client; else setup
              // the scanner and retry.
              Throwable cause = e.getCause();
              if ((cause != null && cause instanceof
NotServingRegionException) ||
                (cause != null && cause instanceof
RegionServerStoppedException) ||
                e instanceof OutOfOrderScannerNextException) {
                // Pass
                // It is easier writing the if loop test as list of what is
allowed rather than
                // as a list of what is not allowed... so if in here, it
means we do not throw.
              } else {
                throw e;
              }
            }


On Wed, Oct 15, 2014 at 3:58 AM, James Estes <james.estes@gmail.com> wrote:

> I'm having an issue where a Scan gets in a loop, and never completes and
> never times out. I've seen it run for hours, and is repeatable on my
> system. After looking through hbase code and logs, I think I understand
> what is going on.
>
> I'm using hbase 0.96.0-hadoop2, running on Hadoop 2.2.0. I'm using the same
> version of the HBase client.
>
> We do an hourly incremental backup using a simple Java class that sets up a
> Scan with a time range. We're only adding about 250,000 cells ~100MB per
> hour across 46 regions on 8 region servers, so there isn't much, but it'll
> be skipping a lot of data (after a compaction). This Scan will regularly
> get 'stuck' where we'll be alerted that it has been running for hours.
>
> Turns out that one of the scan rpc calls is taking a long time (>
> rpcTimeout which we have set to 10s), and the ClientScanner.next call loops
> forever retrying because for some reason, when the RpcRetryingCaller
> retries after the timeout, it gets an OutOfOrderScannerNextException which
> the ClientScanner treats as needing to reset the scan. So without another
> break condition, the ClientScanner.next keeps resetting the scanner, only
> to get the exact same result.
>
> Here is a rough sequence:
> 1 - the scan gets near the end of a region, and we process those results
> 2 - then it goes back for more via ClientScanner.next(), and there isn't
> any new data at the end of that region, so the rpc call comes back to the
> client (but no new keys are in the result)
> 3 - then it moves the scan forward to the next region, and goes to fetch
> more data. This time though, this next region has recently been compacted
> and it cannot prune store files by the timestamp range, and has a large
> amount of data to scan through (again filtering by timestamp).
> 4 - We have our rpcTimeout set to 10s. This scan rpc call takes longer than
> that b/c of all the data it is churning through.
> 5 - The RetryingRpcHandler gets a timeout exception, and will retry
> 6 - The server almost immediately notices this (usually within 10-20ms). I
> think this indicates that the scan is actively progressing pretty quickly
> through the data because it is able to check and notice the client has gone
> away so quickly...there is just a lot of data to get through.
> 7 - The client retries after 100ms pause time (from the
> RetryingRpcHandler). But the server immediately rejects the scan returning
> an OutOfOrderScannerNextException. The client thinks the call
> next_call_seq=0, server says next_call_seq=1. This maybe shouldn't matter
> much, it just means the client would reset the scanner and we've only had
> one extra rpc call here since the ClientScanner treats
> OutOfOrderScannerNextException as a signal to reset the scanner. But there
> is no other condition to stop the loop via a timeout or retry count.
> 8 - So the client side resets the scanner back to the last key from the
> previous results, and goes to fetch data. And we are back to #1. Right
> here, there seems to be no check to decide if the ClientScanner.next has
> taken too long overall.
>
> Here are some annotated logs:
> https://gist.github.com/housejester/a0a530279eaa868db877
>
> Increasing our rpcTimeout fixes the issue, since we give the Scan enough
> time to complete (it didn't take much, 12s was fine, but we've bumped it up
> to 60s). However, it seems like the ClientScanner.next should either have a
> timeout or max reset count? Maybe rpcTimeout * retries? I think if the
> OutOfOrderScannerNextException didn't occur, then the normal timeout path
> would have worked.
>
> James
>

--001a11c2b7082f2540050572f16a--