Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 33129 invoked from network); 16 Mar 2010 23:52:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 23:52:40 -0000 Received: (qmail 51106 invoked by uid 500); 16 Mar 2010 23:52:40 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 51086 invoked by uid 500); 16 Mar 2010 23:52:39 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 51074 invoked by uid 99); 16 Mar 2010 23:52:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 23:52:39 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=10.0 tests=AWL,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 23:52:35 +0000 Received: from [10.73.135.250] (wifi-e-135-250.corp.yahoo.com [10.73.135.250]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o2GNpZfY056614; Tue, 16 Mar 2010 16:51:36 -0700 (PDT) Message-ID: <4BA01985.1010906@apache.org> Date: Tue, 16 Mar 2010 16:51:33 -0700 From: Patrick Hunt User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org Subject: Re: permanent ZSESSIONMOVED References: <3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com> <4B9FA1A3.8020908@yahoo-inc.com> <3b910d891003160827x4326f05bx4f3b1d60d47ba890@mail.gmail.com> <4B9FA53C.9050804@yahoo-inc.com> <3b910d891003160854u3fe1364ek5409a8c40ac3d126@mail.gmail.com> <4B9FAF64.4000201@apache.org> <3b910d891003161003p3b9d5d7bs2959702d21452e38@mail.gmail.com> <4B9FCC6F.5070506@apache.org> <3b910d891003161151y6ba435e8vebf669caea94260e@mail.gmail.com> <4B9FD670.7070600@apache.org> <3b910d891003161238p2e5221bbuef8b64f8356bc859@mail.gmail.com> <4B9FE157.9060303@apache.org> In-Reply-To: <4B9FE157.9060303@apache.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit It will be good to see the logs, however I had one additional thought. The leader (the zk leader) is the one checking for session MOVED. It keeps track of which server the session is currently attached to and will throw the moved exception if the session proposes a request through a server other than who the leader thinks is the owner. I'm wondering, if/when you see this again, if you restart the server that the session is attached to (use netstat on the client for this) what would happen. The client will re-attach to the cluster, I'm wondering if this would fix the problem. (rather than trying to restart the client as you have been doing). Not sure if you can try this (production env?) but it would be an interesting additional data point if you can give it a try. Regards, Patrick Patrick Hunt wrote: > Yes, if you search "back" (older entries) in the server log you will be > able to see who the leader is, it will say something like "LEADING" or > "FOLLOWING", but this may change over time (which is why you need to > search "back" as I mention) if leadership within the ZK cluster changes > (say due to networking issue). This is why I mention the logs so highly > - it really will give us much additional insight into the issue. > > here's an example of a 5 server ensemble: > phunt@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep LEAD > local*/*.log > localhost:2184/zoo.log:2010-03-16 12:50:13,711 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2184:QuorumPeer@632] - LEADING > phunt@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep FOLLOW > local*/*.log > localhost:2181/zoo.log:2010-03-16 12:50:13,649 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - FOLLOWING > localhost:2182/zoo.log:2010-03-16 12:50:13,933 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2182:QuorumPeer@620] - FOLLOWING > localhost:2183/zoo.log:2010-03-16 12:50:13,901 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2183:QuorumPeer@620] - FOLLOWING > localhost:2185/zoo.log:2010-03-16 12:50:13,661 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2185:QuorumPeer@620] - FOLLOWING > > > Additionally if you use the "stat" 4letter word you will see the current > status of the server, leader or follower. (JMX as well) > > You might also find this useful: http://github.com/phunt/zktop > > Patrick > > Łukasz Osipiuk wrote: >> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt wrote: >>> We'll probably need the ZK server/client logs to hunt this down. Can you >>> tell if the MOVED happens in some particular scenario, say you are >>> connected >>> to a follower and move to a leader, or perhaps you are connected to >>> server >>> A, get disconnected and reconnected to server A? .... is there some >>> pattern >>> that could help us understand what's causing this? >>> >> >> When I get to office tomorrow I will try to investigate logs and maybe >> i will be able to find out what the error scenario is. >> But I am not sure if I will be able to find out what was the role of >> each node when problem occurred? >> Does zookeeper server log when node state changes between follower and >> leader. Or can I make it log it? >> >> Regards, Łukasz >> >> >>