hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: permanent ZSESSIONMOVED
Date Wed, 17 Mar 2010 18:48:21 GMT
Thanks for this Michael, a bit of clarification/advice inline below:

Michael Bauland wrote:
> Hi,
> 
>> This is very different from most uses of ZK.  Normally Zookeeper is used on
>> a private network with almost no packet loss.  Your high rate of loss may be
>> tickling a protocol bug that other people
>> just never see.
> 
> I'm not sure, but maybe my input could help, too. As I mentioned
> earlier, I also run the three zookeeper servers not in a local
> environment but across two sites in different countries (soon, in
> production, it'll be three sites).
> I'm accessing the zookeeper ensemble continuously with about 5 clients
> for testing purposes. These clients are running on the same machine as
> one of the three zookeeper servers. I have attached logs of two of the
> zookeeper servers (one leader and one follower).
> My clients have a way to recover from connectionloss by trying ten
> times, then they wait for 5 seconds, close the zookeeper connection and
> open a new one and try up to ten times again; then again waiting 10
> secs, closing, reconnecting, trying, etc., up to 50 seconds, then they fail.

The zk client lib will handle connection loss automatically, not sure 
what you mean by "try ten times" to recover from conn loss.

Basically the client lib will notify your watcher when the session is 
disconnected from the cluster. If you have "in process" operations they 
will get the "connection loss" exception. However once this happens you 
want to wait, the ZK client lib will attempt to connect to the cluster 
again (one of the servers in the connect string), once it does it will 
notify you again via the watcher (sync connected event).

If the session reconnects to the cluster within the session timeout then 
all watches are restored and nothing is necessary on your part. If you 
exceed the timeout then the cluster will have expired your session and 
you would be notified of this immediately upon reconnect to the cluster.

See item 3 in the faq, there's some detail there:
http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3

So really in your code you want to:

1) if you get disconnected just wait till you get reconnected, our 
library handles all that for you.
2) if you get expired then you must create a new session.

I seem to remember you may have had some issues with local vs remote 
though... so it may not be as clearcut. But in general our client lib 
handles all this for you.

> Sometimes all of the clients fail at about the same time. When I look at
> the zookeeper logs I see that they report that the client connection
> limit has been reached (which is set to 10, since I didn't set the value
> at all), although this shouldn't happen since I have just 5 clients
> running and none of them opens two connections at the same time, since
> they're just single threaded.
> 

These clients are all on the same ip right?

Are you sure you are closing the sessions in all cases (the old ones)? 
It could also be the case that you

1) create a session, it gets disco, so you close it and
2) create a new session (and so on)

the session created in 1) (even if you closed it) may still be 
considered alive by the server until a) it expires, b) the session close 
in 1) eventually makes it's way back to the server.

> To me it seems that there are some connections which actually shouldn't
> be there (anymore).

The easiest way to see this is using the "dump" command on the leader. 
It will tell you the sessions that the cluster thinks is still active. 
Run this command while running your test and see what it reports...
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

> For the logs, note that the clients are running also on the server which
> is at that time the follower and I replaced my IPs by letters.

Take a look at this session 0x226f49bb5d78ea8 for example:

2010-03-08 15:27:20,525 - INFO 
[NIOServerCxn.Factory:2181:NIOServerCnxn@639] - Creating new session 
0x226f49bb5d78ea8
2010-03-08 15:27:22,253 - INFO  [CommitProcessor:0:NIOServerCnxn@992] - 
Finished init of 0x226f49bb5d78ea8 valid:true
2010-03-08 15:27:22,254 - WARN 
[NIOServerCxn.Factory:2181:NIOServerCnxn@518] - Exception causing close 
of session 0x226f49bb5d78ea8 due to java.io.IOException: Read error
2010-03-08 15:27:22,254 - INFO 
[NIOServerCxn.Factory:2181:NIOServerCnxn@857] - closing 
session:0x226f49bb5d78ea8 NIOServerCnxn: 
java.nio.channels.SocketChannel[connected local=/A.B.C.E:2181 
remote=/A.B.C.D:49258]
2010-03-08 15:27:28,000 - INFO  [SessionTracker:SessionTrackerImpl@133] 
- Expiring session 0x226f49bb5d78ea8
2010-03-08 15:27:28,001 - INFO  [SessionTracker:ZooKeeperServer@326] - 
Expiring session 0x226f49bb5d78ea8
2010-03-08 15:27:28,001 - INFO 
[ProcessThread:-1:PrepRequestProcessor@384] - Processed session 
termination request for id: 0x226f49bb5d78ea8

Looks like it's using a 5sec timeout, I can see that it's connecting to 
the server, then the connection is failing (or client closing? I don't 
have the log for that) after just a fraction of a second. Then the 
server is expiring the session 5 seconds later. This means that the 
server never saw a close from the client...

here's an example where the client called close:
2010-03-08 15:27:19,053 - INFO 
[ProcessThread:-1:PrepRequestProcessor@384] - Processed session 
termination request for id: 0x226f49bb5d78ea6
2010-03-08 15:27:20,099 - INFO  [CommitProcessor:0:NIOServerCnxn@857] - 
closing session:0x226f49bb5d78ea6 NIOServerCnxn: 
java.nio.channels.SocketChannel[connected local=/A.B.C.E:2181 
remote=/A.B.C.D:49249]

checkout time index  15:27:20,605 in the leader logs. Looks to me like 
network connectivity problems... notice that a bit later the clients are 
renewing the sessions successfully.


As I mentioned, try using the dump command:
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

and see if that sheds any light on the maxclientcnxns issue.

You could also look at both the client and server logs and correlate the 
sessions in your client against information in the server (ie which 
sessions are active at which times).

Hope this helps, regards,

Patrick

Mime
View raw message