hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Hunt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server
Date Wed, 24 Feb 2010 20:37:28 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837994#action_12837994
] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

I agree, this is a potentially serious issue. Unfortunately though, based on the information
we have I don't see how I can provide more insight. Also take into account that we have many
users in similar situation, however this is the first we've heard of this type of issue, ever.
(not that that diminishes your issue) So I just don't have that much to go on.

I would suggest that you check your monitoring script and ensure it handles all error cases,
such as failing to connect to the server, or getting a partial response due to things like
the linger issue.

Also ensure that you can capture the server/client logs if this does happen again. If it does
happen capture the full/detailed netstat (netstat -a I guess) so that we can get detailed
information.

You might also make sure to save the transactional logs if this happens again. Not the log4j
logs, but the transaction logs that are kept in the datadir. Those can actually be scanned
and we can see what was going on (changes to znodes as well as session info).

Can you think of anything else that would help here? Have you been able to reproduce the problem?
Have you tried reproducing it and can't? That's all I can think of currently.

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content
in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found
the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total
was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections
from the clients. The only thing I can do under this situation is restart the server.200,
at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message