accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (Updated) (JIRA)" <>
Subject [jira] [Updated] (ACCUMULO-327) master lost all tablet servers
Date Fri, 27 Jan 2012 21:56:09 GMT


Keith Turner updated ACCUMULO-327:

    Fix Version/s: 1.4.0

This may not be an issue in 1.3 because there is no merge operation where the master ask a
tablet server to split.  I am not sure if there are other tserver operations where the synchronization
of the connection could cause deadlock.
> master lost all tablet servers
> ------------------------------
>                 Key: ACCUMULO-327
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: running the random walk test on a small cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>             Fix For: 1.4.0
> Master would occasionally take a long time to collect status information from a tablet
server.  The connection would timeout after the default 120 second RPC time.  This probably
left the connection in a bad state because I am seeing
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got
>         at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(
> {noformat}
> If the master is unable to collect statistics on the tablet server, it attempts to halt
it (as above) and then it removes its lock in zookeeper.
> Eventually, under the pressure of random walk operations, the master killed every tablet
> Guess: a lock in the tablet server is delaying status reporting.
> I wrote a script to process the master logs.  It saves each line that refers to the IP
address of a tablet server.  When it sees the zookeeper lock has been deleted, it prints the
last N lines that refer to that tablet server.
> In 7 out of the 10 cases, a split timed out prior or during the status request failures.
> In 5 cases, the tablet server was hosting the root tablet (a necessary condition when
the last server died).
> In 5 cases, the table_table info tablet was being hosted.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message