accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Wall (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2112) master does not balance after intermittent communication failure
Date Mon, 30 Dec 2013 22:25:50 GMT


Michael Wall commented on ACCUMULO-2112:

This issue showed up in the master logs as "ERROR: unable to get tablet server status".  The
tservers appeared to lose connection for a brief time, less than 30 seconds, but then start
communicating again.  The server would then show up by IP in the list of Unresponsive servers
and by hostname in the Tablet Servers when looking at the Tablet Server page of the monitor.

I can verify applying this one line fix to the 1.4.4 tag removes the server from the list
of unresponsive servers and balancing begins again when there are no unresponsive servers.

The "unable to get server status"  should still show up in the master logs.  Maybe it is actually

> master does not balance after intermittent communication failure
> ----------------------------------------------------------------
>                 Key: ACCUMULO-2112
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.5.0, 1.5.1
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.5, 1.5.1, 1.6.0
> The master had a momentary connection timeout error collecting stats from a single tablet
server.  Because the connection was re-established on the next attempt, the master did not
remove it from the bad servers list.  Because the bad server list was not cleared, it did
not re-balance.

This message was sent by Atlassian JIRA

View raw message