Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 83892 invoked from network); 23 Feb 2010 21:21:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Feb 2010 21:21:11 -0000 Received: (qmail 68372 invoked by uid 500); 23 Feb 2010 21:21:11 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 68332 invoked by uid 500); 23 Feb 2010 21:21:11 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 68317 invoked by uid 99); 23 Feb 2010 21:21:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2010 21:21:11 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.147.107.20] (HELO mrout1-b.corp.re1.yahoo.com) (69.147.107.20) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2010 21:21:01 +0000 Received: from [10.73.135.251] (wifi-e-135-251.corp.yahoo.com [10.73.135.251]) by mrout1-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o1NLJHb6045137; Tue, 23 Feb 2010 13:19:17 -0800 (PST) Message-ID: <4B844654.2060406@apache.org> Date: Tue, 23 Feb 2010 13:19:16 -0800 From: Patrick Hunt User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org Subject: Re: Bit of help debugging a TIMED OUT session please References: <7c962aed1002221142q43a9387eo6574be49e1fbbaf3@mail.gmail.com> <7c962aed1002222233m7263147cna4b944e79d248894@mail.gmail.com> In-Reply-To: <7c962aed1002222233m7263147cna4b944e79d248894@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Stack you might look at the following: 1) why does server 14 have such a low recv count? Received: 194 while the other servers are at 3.7k + received. Did server 14 fail at some point? Or it's network? This may have caused the timeout seen by the client: ------snippet----- 2010-02-21 18:23:55,583 [main-SendThread] INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 14.u.XXX.com/X.X.X.141:2181 2010-02-21 18:24:00,423 [regionserver/208.76.44.140:60020.compactor-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x226ed968a270003 to sun.nio.ch.SelectionKeyImpl@2a50e9a3 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) ----------- 2) connection timeout is different from session timeout. connection timeout is the amount of time we allow for connection establishment (socket open) until the server accepts the connection, this value is the session timeout (as requested by the client) divided by the number of hosts in the host list. This could account for why the timeout (above snippet) occurred after 5 seconds. What timeout value is this client using? 15 seconds? Patrick Stack wrote: > Dang. Didn't save the log. Pardon me. > > I pasted exceptions only and thought it all about 0x26ed968d880001 > session but now I see that what I posted above has TIMED_OUT on > another session altogether. Above I skipped pasting exceptions > thinking them on the same session but now it seems they probably were > not. > > I'm trying to track a case where zk seems of a sudden, client-side, to > give up the ghost w/ exceptions like those pasted above -- > connectivity probs. There has been pollution in here where long gc > pauses that are > session timeout would trigger TIMED_OUT but those > have been tamed. > > I'll be back if I get another instance on my hook. > > Meantime, thanks for the comments. > > St.Ack > > On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar wrote: >> HI stack, >> the other interesting part is with the session: >> 0x26ed968d880001 >> >> Looks like it gets disconnected from one of the servers (TIMEOUT). DO you >> see any of these messages: "Attempting connection to server" in the logs >> before you see all the consecutive >> >> org.apache.zookeeper.ClientCnxn: Exception closing session >> 0x26ed968d880001 to sun.nio.ch.SelectionKeyImpl@788ab708 >> java.io.IOException: Read error rc = -1 >> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] >> at >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) >> at >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) >> >> and.... >> >> >> From the cient 0x26ed968d880001? >> >> Thanks >> mahadev >> >> >> On 2/22/10 11:42 AM, "Stack" wrote: >> >>> The thing that seems odd to me is that the connectivity complaints are >>> out of the zk client, right?, why is it failing getting to member 14 >>> and why not move to another ensemble member if issue w/ 14?, and if >>> there were a general connectivity issue, I'd think that the running >>> hbase cluster would be complaining at about the same time (its talking >>> to datanodes and masters at this time). >>> >>> (Thanks for the input lads) >>> >>> St.Ack >>> >>> >>> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar wrote: >>>> I also looked at the logs. Ted might have a point. It does look like that >>>> zookeeper server's are doing fine (though as ted mentions the skew is a >>>> little concerning, though that might be due to very few packets served by >>>> the first server). Other than that the latencies of 300 ms at max should not >>>> cause any timeouts. >>>> Also, the number of packets received is pretty low - meaning that it wasn't >>>> serving huge traffic. Is there anyway we can check if the network connection >>>> from the client to the server is not flaky? >>>> >>>> Thanks >>>> mahadev >>>> >>>> >>>> On 2/22/10 10:40 AM, "Ted Dunning" wrote: >>>> >>>>> Not sure this helps at all, but these times are remarkably asymmetrical. I >>>>> would expect members of a ZK cluster to have very comparable times. >>>>> >>>>> Additionally, 345 ms is nowhere near large enough to cause a session to >>>>> expire. My take is that ZK doesn't think it caused the timeout. >>>>> >>>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: >>>>> >>>>>> Latency min/avg/max: 2/125/345 >>>>>> ... >>>>>> Latency min/avg/max: 0/7/81 >>>>>> ... >>>>>> Latency min/avg/max: 1/1/1 >>>>>> >>>>>> Thanks for any pointers on how to debug. >>>>>> >>>> >>