Return-Path: Delivered-To: apmail-hadoop-zookeeper-dev-archive@minotaur.apache.org Received: (qmail 25432 invoked from network); 19 Mar 2010 05:43:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Mar 2010 05:43:51 -0000 Received: (qmail 57497 invoked by uid 500); 19 Mar 2010 05:43:49 -0000 Delivered-To: apmail-hadoop-zookeeper-dev-archive@hadoop.apache.org Received: (qmail 57112 invoked by uid 500); 19 Mar 2010 05:43:49 -0000 Mailing-List: contact zookeeper-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-dev@hadoop.apache.org Delivered-To: mailing list zookeeper-dev@hadoop.apache.org Received: (qmail 57087 invoked by uid 99); 19 Mar 2010 05:43:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 05:43:48 +0000 X-ASF-Spam-Status: No, hits=-1064.4 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 05:43:47 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 63316234C4B6 for ; Fri, 19 Mar 2010 05:43:27 +0000 (UTC) Message-ID: <289860414.357311268977407405.JavaMail.jira@brutus.apache.org> Date: Fri, 19 Mar 2010 05:43:27 +0000 (UTC) From: "Benjamin Reed (JIRA)" To: zookeeper-dev@hadoop.apache.org Subject: [jira] Updated: (ZOOKEEPER-710) permanent ZSESSIONMOVED error after client app reconnects to zookeeper cluster In-Reply-To: <1808365257.339801268912432093.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ZOOKEEPER-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated ZOOKEEPER-710: ------------------------------------ Hadoop Flags: [Reviewed] +1 great work pat! and thanx Lukasz for identifying this failure condition. > permanent ZSESSIONMOVED error after client app reconnects to zookeeper cluster > ------------------------------------------------------------------------------ > > Key: ZOOKEEPER-710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-710 > Project: Zookeeper > Issue Type: Bug > Components: server > Affects Versions: 3.2.2 > Environment: debian lenny; ia64; xen virtualization > Reporter: Lukasz Osipiuk > Assignee: Patrick Hunt > Priority: Blocker > Fix For: 3.2.3, 3.3.0 > > Attachments: app1.log.2010-03-16.gz, app2.log.2010-03-16.gz, ZOOKEEPER-710_3.3.patch, zookeeper-node1.log.2010-03-16.gz, zookeeper-node2.log.2010-03-16.gz, zookeeper-node3.log.2010-03-16.gz > > > Originally problem was described on Users mailing list starting with this [post|http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201003.mbox/<3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com>]. > Below I restate it in more organized form. > We occasionally (few times a day) observe that our client application disconnects from Zookeeper cluster. > Application is written in C++ and we are using libzookeeper_mt library. In version 3.2.2. > The disconnects we are observing are probably related to some problems with our network infrastructure - we are observing periods with great packet loss between machines in our DC. > Sometimes after client application (i.e. zookeeper library) reconnects to zookeeper cluster we are observing that all subsequent requests return ZSESSIONMOVED error. Restarting client app helps - we always pass 0 as clientid to zookeeper_init function so old session is not reused. > On 16-03-2010 we observed few occurences of problem. Example ones: > - 22:08; client IP 10.1.112.60 (app1); sessionID 0x22767e1c9630000 > - 14:21; client IP 10.1.112.61 (app2); sessionID 0x324dcc1ba580085 > I attach logs of cluster and application nodes (only stuff concerining zookeeper): > - [^zookeeper-node1.log.2010-03-16.gz] - logs of zookeepr cluster node 1 10.1.112.62 > - [^zookeeper-node2.log.2010-03-16.gz] - logs of zookeepr cluster node 2 10.1.112.63 > - [^zookeeper-node3.log.2010-03-16.gz] - logs of zookeepr cluster node 3 10.1.112.64 > - [^app1.log.2010-03-16.gz] - application logs of app1 10.1.112.60 > - [^app2.log.2010-03-16.gz] - application logs of app2 10.1.112.61 > I also made some analysis of case at 22:08: > - Network glitch which resulted in problem occurred at about 22:08. > - From what I see since 17:48 node2 was the leader and it did not > change later yesterday. > - Client was connected to node2 since 17:50 > - At around 22:09 client tried to connect to every node (1,2,3). > Connections to node1 and node3 were closed > with exception "Exception causing close of session 0x22767e1c9630000 > due to java.io.IOException: Read error". > Connection to node2 stood alive. > - All subsequent operations were refused with ZSESSIONMOVED error. > Error visible both on client and on server side. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.