accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-3159) BatchScanner very aggressive after Connection Refused
Date Mon, 22 Sep 2014 19:02:34 GMT
Josh Elser created ACCUMULO-3159:
------------------------------------

             Summary: BatchScanner very aggressive after Connection Refused
                 Key: ACCUMULO-3159
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3159
             Project: Accumulo
          Issue Type: Improvement
          Components: client
    Affects Versions: 1.6.0, 1.5.2
            Reporter: Josh Elser
            Priority: Minor


Running the replication tests, I tend to find a lot of spam in the Master's log of the following:

{noformat}
[impl.TabletServerBatchReaderIterator] DEBUG: Server : hostname:port msg : java.net.ConnectException:
Connection refused
{noformat}

Most of the replication tests will restart a tabletserver to trigger log recovery (to ultimately
make sure that a file gets pushed through the replication process). As part of the bookkeeping
the Master is doing, it's reading the metadata and replication table(s) to figure out if it
needs to assign any work, clean up old work, etc. It uses a batchscanner to do this.

What I believe to be happening is the BathScanner tries to get the TabletClientService client
object for a tabletserver which is dead (the one we killed). This throws a TTransportException
which we wrap in an IOException and throw up the pipe.

{code}
client = ThriftUtil.getTServerClient(server, conf, timeoutTracker.getTimeOut());
{code}

{code}
} catch (TTransportException e) {
      log.debug("Server : " + server + " msg : " + e.getMessage());
      timeoutTracker.errorOccured(e);
      throw new IOException(e);
}
{code}

The caller (the threadpool inside the batchscanner) catches the IOException, tracks the failure
that happened, invalidates the cached tablets for the tserver (the one we got the connection
refused from) and repeats (re-bin the ranges to tablets, re-submit the query task).

When this is the only thing happening, this occurs in a really tight loop (ones to tens of
milliseconds). Seems excessive to be repeatedly bashing the same tserver that we already got
a connection refused from. Perhaps the catch on TTransportException can be enhanced to introduce
some backoff on connection refused? Alternatively, we could be a little smarter when processing
failures to be less aggressive?

The converse is that, in some cases, we likely want to spin quickly. For the cases where a
client has stale tablet information and another tablet server has already picked up the tablets,
we want the client to retry immediately so they can (hopefully) get their results from the
new server. Any change made would definitely need to only back off on the retries when the
same server is chosen again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message