hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Hunt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server
Date Wed, 03 Feb 2010 18:03:27 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829162#action_12829162

Patrick Hunt commented on ZOOKEEPER-662:

Qian, if you look at the logs you can see both of these clients, the client I mentioned in
earlier comment, also the "stat" client:

2010-02-01 06:24:49,783 - INFO  [NIOServerCxn.Factory:8181:NIOServerCnxn@698] - Processing
stat command from /
2010-02-01 06:24:49,783 - WARN  [NIOServerCxn.Factory:8181:NIOServerCnxn@494] - Exception
causing close of session 0x0 due to java.io.IOException: Responded to info probe

(really the second line should not be a warn, this is improved in 3.3.0 codebase).

>From the logs I don't see anything to indicate a problem though. I'm wondering if there
is some timing problem in either our c or java networking code (also you are using linux 2.6.9
which is older kernel, I'm wondering if perhaps the timing our app sees is different).

One thing about the 4 letter words (like stat). In some cases I've seen the response from
the 4letter word be truncated. Perhaps this caused your monitoring app to fail? You might
add some diags to your monitor app to debug this sort of thing.

What I mean is, you request a "stat" and the client sees some of the response, but not all
of the response. I'm not sure why this is, but
it may have something to do with either the way nc works (I always use nc for this) or the
way the server works - in the sense that
the server pushes the response text onto the wire and then closes the connection. Perhaps
in some cases the socket close causes the client
to not see all the response? Is that possible in tcp close?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content
in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=
> server.101=
> server.102=
> server.200=
> server.201=
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found
the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total
was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections
from the clients. The only thing I can do under this situation is restart the server.200,
at about 2010-02-01 06:06:35. The related log is attached to the issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message