hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: unstable cluster
Date Tue, 12 Apr 2016 02:03:32 GMT
>From region server log:

2016-04-11 03:11:51,589 WARN org.apache.zookeeper.ClientCnxnSocket:
Connected to an old server; r-o mode will be unavailable
2016-04-11 03:11:51,589 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x52ee1452fec5ac has expired,
closing socket connection

>From zookeeper log:

2016-04-11 03:11:27,323 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.19:58404 which had sessionid
0x52ee1452fec71f
2016-04-11 03:11:53,301 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.13:32946 which had sessionid
0x52ee1452fec6ea

Note the 26 second gap.

What do you see in the logs of the other two zookeeper servers ?

Thanks

On Mon, Apr 11, 2016 at 5:08 PM, Ted Tuttle <ted@mentacapital.com> wrote:

> Hello -
>
> We've started experiencing regular failures of our HBase cluster.  For the
> last week we've had nightly failures about 1hr after a heavy batch process
> starts.
>
> In the logs below we see the failure starting at 2016-04-11 03:11 in
> zookeeper, master and region server logs:
>
> zookeeper:  http://pastebin.com/kf7ja22K
>
> region server: http://pastebin.com/tduJgKqq
>
> master:  http://pastebin.com/0szhi0bJ
>
> The master log seems most interesting.  Here we see problems connecting to
> Zookeeper then a number of region servers dying in quick succession.  From
> the log evidence it appears Zookeeper is not responding rather than the
> more typical GC causing isolated RS to abort.
>
> Any insights on what may be happening here?
>
> Best,
> Ted
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message