accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3276) Shard.xml hung with no client output
Date Thu, 30 Oct 2014 00:09:33 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189349#comment-14189349
] 

Josh Elser commented on ACCUMULO-3276:
--------------------------------------

Made some progress figuring out what exactly happened. The ShardFixture never completed, so
the test hung there. Because the Fixture wasn't running in the executor that I recently added
for nodes, the entire test reported a failure (which I probably desired since the fixtures
really shouldn't be taking any real length of time). This fixture creates an index table,
adds a random number of splits to it, then create a data table and adds a different random
number of splits to that table. The following actually happened:

* Master created the index table (id of 'b')
* TabletServer made split points for the index table
* Master acknowledged split points from tserver
* Master created the data table (id of 'c')
* Master assigned default tablet for 'c' to a tabletserver
* Tablet never came online on that tabletserver
* END

Because the tablet never came online, I believe the client's call to addSplits sat indefinitely
waiting for the tablet to be hosted. My guess is that the client hit an indefinite retry loop
with the server throwing a NotServingTabletException, clearing the location cache, then retrying
the same call to the same tserver. I haven't been able to verify this 100% yet.

The tabletserver had some assignment requests come to it (7 requests every minute all at once),
but I don't know what tablets they assignments were for, only that they were from {{!SYSTEM}}.
There was never a {{TABLET_HIST}} message in any of the tserver logs for a tablet from table
{{c}}. I've been unable to determine whether or not the master didn't actually retry the assignment
of the tablet for {{c}} or if the TabletServer repeatedly ignored/failed the tablet load requests.
Trying to trace through the master assignment code to see if there's anywhere else I can grok
some more information. The TabletServer I expected to be hosting this tablet did not throw
any exceptions nor did any of the other tablet servers for that matter.

> Shard.xml hung with no client output
> ------------------------------------
>
>                 Key: ACCUMULO-3276
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3276
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.6.1
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.6.2, 1.7.0
>
>
> Ran Shard.xml over a 5 node instance. The only line of client output I got was that ZooSession
connected to the quorum.
> 45 minutes later, my test runner timed out the module. We need more information in the
client test log to actually determine where it got stuck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message