hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qian Ye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server
Date Fri, 05 Feb 2010 02:22:27 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829920#action_12829920
] 

Qian Ye commented on ZOOKEEPER-662:
-----------------------------------

Hi Patrick, the c clients all run in a Linux environment, the kernels are 2.6.9. Some of the
servers are 32 bit machines and some of them are 64 bits. It seems that the client on the
server 10.81.14.81 has some problem, which caused the client to fail frequently. Because there
is a monitor app which can restart the c client when it failed, the client on 10.81.14.81
keep restarting and connecting to the zookeeper servers frequently.  

You mentioned that some of the response for request "stat" didn't reach the client, it looks
like the behaviors of TCP connection with SO_LINER option on. In this kind of situation, the
server only put the response on the wire and close, however, the response package may be discarded,
and the TCP/IP stack wouldn't re-send the response. Is it the scenario we met here?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content
in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found
the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total
was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections
from the clients. The only thing I can do under this situation is restart the server.200,
at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message