zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Han (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3240) Close socket on Learner shutdown to avoid dangling socket
Date Fri, 11 Jan 2019 08:35:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740161#comment-16740161

Michael Han commented on ZOOKEEPER-3240:

[~nixon] Good catch, the fix looks reasonable. 

I've seen similar issue in my production environment, the fix I did was on Leader side where
I tracked the LearnerHandler threads associated with server ids, and make sure each server
id only has a single LearnerHandler thread. This also work in cases where the learners don't
have a chance to close their sockets, or they did but due to some reasons the TCP reset never
made it to leader. But in any case, it's good to fix the resource leaking on learner side.

I also wonder why we could get into such case on Leader side in first place. On leader, we
do have socket read timeout set via setSoTimeout for leaner handler threads (after the socket
was created via serverSocket.accept), and each learner handler would constantly polling /
trying read from the socket afterwards. If, on a learner it dies but left a valid socket open,
I was expecting one leader side the LearnerHandler thread that trying to read from that died
learner socket will eventually timeout, which, will throw SocketTimeOutException and cause
the LearnerHandler thread on the leader kill itself. This though does not seem to be the case
I observed. Do you have any insights on this?

> Close socket on Learner shutdown to avoid dangling socket
> ---------------------------------------------------------
>                 Key: ZOOKEEPER-3240
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3240
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.6.0
>            Reporter: Brian Nixon
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
> There was a Learner that had two connections to the Leader after that Learner hit an
unexpected exception during flush txn to disk, which will shutdown previous follower instance
and restart a new one.
> {quote}2018-10-26 02:31:35,568 ERROR [SyncThread:3:ZooKeeperCriticalThread@48] - Severe
unrecoverable error, from thread : SyncThread:3
> java.io.IOException: Input/output error
>         at java.base/sun.nio.ch.FileDispatcherImpl.force0(Native Method)
>         at java.base/sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:72)
>         at java.base/sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:395)
>         at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:457)
>         at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:548)
>         at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:769)
>         at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:246)
>         at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:172)
> 2018-10-26 02:31:35,568 INFO  [SyncThread:3:ZooKeeperServerListenerImpl@42] - Thread
SyncThread:3 exits, error code 1
> 2018-10-26 02:31:35,568 INFO [SyncThread:3:SyncRequestProcessor@234] - SyncRequestProcessor
> It is supposed to close the previous socket, but it doesn't seem to be done anywhere
in the code. This leaves the socket open with no one reading from it, and caused the queue
full and blocked on sender.
> Since the LearnerHandler didn't shutdown gracefully, the learner queue size keeps growing,
the JVM heap size on leader keeps growing and added pressure to the GC, and cause high GC
time and latency in the quorum.
> The simple fix is to gracefully shutdown the socket.

This message was sent by Atlassian JIRA

View raw message