zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fangmin Lv (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ZOOKEEPER-3296) Cannot join quorum due to Quorum SSLSocket connection not closed explicitly when there is handshake issue
Date Thu, 07 Mar 2019 18:10:00 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Fangmin Lv updated ZOOKEEPER-3296:
----------------------------------
    Description: 
Recently, on prod ensembles, we saw some peers failed to connect to others due to timed out
when connecting to the other's leader election port. This was triggered by a network incident
with lots of packet loss.

After investigation, we found it's because we doesn't close the socket explicitly when it
timed out during ssl handshake in QuorumCnxManager.connectOne.

The quorum connection manager is handling connections sequentially with a default listen backlog
queue size 50, during the network loss, there are socket read timed out, which is syncLimit
* tickTime, and almost all the following connect requests in the backlog queue will timed
out from the other side before it's being processed. Those timed out learners will try to
connect to a different server, and leaves the connect requests on server side without sending
the close_notify packet. The server is slowly consuming from these queue with syncLimit *
tickTime timeout for each of those requests which haven't sent notify_close packet. Any new
connect requests will be queued up again when there is spot in the listen backlog queue, but
timed out before the server handles it, and it can never successfully finish any new connection,
so it failed to join the quorum. And the peers are leaking FD because all those connections
are in CLOSE-WAIT state.
  
 Restarting the servers to drain the listen backlog queue mitigated the issue.

Here are the steps to manually reproduce the issue:
 # issuing two telnet connect to server A in the quorum without sending any packet
 # stop all other servers
 # start those again
 # server A read timed out from those telnet connect request one by one and it cannot join
the quorum anymore

  was:
Recently, on prod ensembles, we saw some peers failed to connect to others due to timed out
when connecting to the other's leader election port. This was triggered by a network incident
with lots of packet loss.

After investigation, we found it's because we doesn't close the socket explicitly when it
timed out during ssl handshake in QuorumCnxManager.connectOne.

The quorum connection manager is handling connections sequentially with a default listen backlog
queue size 50, during the network loss, there are socket read timed out, which is syncLimit
* tickTime, and almost all the following connect requests in the backlog queue will timed
out from the other side before it's being processed. Those timed out learners will try to
connect to a different server, and leaves the connect requests on server side without sending
the close_notify packet. The server is slowly consuming from these queue with syncLimit *
tickTime timeout for each of those requests which haven't sent notify_close packet. Any new
connect requests will be queued up again when there is spot in the listen backlog queue, but
timed out before the server handles it, and it can never successfully finish any new connection,
so it failed to join the quorum. And the peers are leaking FD because all those connections
are in CLOSE-WAIT state.
  
 Restarting the servers to drain the listen backlog queue mitigated the issue.

Here are the steps to manually reproduce the issue:

 


> Cannot join quorum due to Quorum SSLSocket connection not closed explicitly when there
is handshake issue
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3296
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3296
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.4, 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Fangmin Lv
>            Priority: Major
>             Fix For: 3.5.4, 3.6.0
>
>
> Recently, on prod ensembles, we saw some peers failed to connect to others due to timed
out when connecting to the other's leader election port. This was triggered by a network incident
with lots of packet loss.
> After investigation, we found it's because we doesn't close the socket explicitly when
it timed out during ssl handshake in QuorumCnxManager.connectOne.
> The quorum connection manager is handling connections sequentially with a default listen
backlog queue size 50, during the network loss, there are socket read timed out, which is
syncLimit * tickTime, and almost all the following connect requests in the backlog queue will
timed out from the other side before it's being processed. Those timed out learners will
try to connect to a different server, and leaves the connect requests on server side without
sending the close_notify packet. The server is slowly consuming from these queue with syncLimit
* tickTime timeout for each of those requests which haven't sent notify_close packet. Any
new connect requests will be queued up again when there is spot in the listen backlog queue,
but timed out before the server handles it, and it can never successfully finish any new connection,
so it failed to join the quorum. And the peers are leaking FD because all those connections
are in CLOSE-WAIT state.
>   
>  Restarting the servers to drain the listen backlog queue mitigated the issue.
> Here are the steps to manually reproduce the issue:
>  # issuing two telnet connect to server A in the quorum without sending any packet
>  # stop all other servers
>  # start those again
>  # server A read timed out from those telnet connect request one by one and it cannot
join the quorum anymore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message