accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3276) Shard.xml hung with no client output
Date Thu, 30 Oct 2014 00:09:33 GMT


Josh Elser commented on ACCUMULO-3276:

Made some progress figuring out what exactly happened. The ShardFixture never completed, so
the test hung there. Because the Fixture wasn't running in the executor that I recently added
for nodes, the entire test reported a failure (which I probably desired since the fixtures
really shouldn't be taking any real length of time). This fixture creates an index table,
adds a random number of splits to it, then create a data table and adds a different random
number of splits to that table. The following actually happened:

* Master created the index table (id of 'b')
* TabletServer made split points for the index table
* Master acknowledged split points from tserver
* Master created the data table (id of 'c')
* Master assigned default tablet for 'c' to a tabletserver
* Tablet never came online on that tabletserver

Because the tablet never came online, I believe the client's call to addSplits sat indefinitely
waiting for the tablet to be hosted. My guess is that the client hit an indefinite retry loop
with the server throwing a NotServingTabletException, clearing the location cache, then retrying
the same call to the same tserver. I haven't been able to verify this 100% yet.

The tabletserver had some assignment requests come to it (7 requests every minute all at once),
but I don't know what tablets they assignments were for, only that they were from {{!SYSTEM}}.
There was never a {{TABLET_HIST}} message in any of the tserver logs for a tablet from table
{{c}}. I've been unable to determine whether or not the master didn't actually retry the assignment
of the tablet for {{c}} or if the TabletServer repeatedly ignored/failed the tablet load requests.
Trying to trace through the master assignment code to see if there's anywhere else I can grok
some more information. The TabletServer I expected to be hosting this tablet did not throw
any exceptions nor did any of the other tablet servers for that matter.

> Shard.xml hung with no client output
> ------------------------------------
>                 Key: ACCUMULO-3276
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.6.1
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.2, 1.7.0
> Ran Shard.xml over a 5 node instance. The only line of client output I got was that ZooSession
connected to the quorum.
> 45 minutes later, my test runner timed out the module. We need more information in the
client test log to actually determine where it got stuck.

This message was sent by Atlassian JIRA

View raw message