Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 58834 invoked from network); 21 May 2010 00:05:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 May 2010 00:05:26 -0000 Received: (qmail 53847 invoked by uid 500); 21 May 2010 00:05:25 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53828 invoked by uid 500); 21 May 2010 00:05:25 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53820 invoked by uid 99); 21 May 2010 00:05:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 May 2010 00:05:25 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 May 2010 00:05:19 +0000 Received: by wwb24 with SMTP id 24so306994wwb.31 for ; Thu, 20 May 2010 17:04:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=2lTlYhQm+WEx3MSq2Orf3JZim8O7caLvUYWRmeGfBLo=; b=jZ8oT7rF1EjDCtVjE1cDWK4lELReNVcEu522/HPbsWF3N7HvJRlEocT6EA5B+8TDE6 cKu8E+kytppR9+dLF1ZamfcG6WgCz0+Vo6eTOnoN6qmlHCk01B8OfOnTiCCGy3+/3R1W WGzENH6TJ/S4sqFornuKNGZ/fjeWEeqTzh5k0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=qTTuY6gYZ4rwqP5d3G0VWQF+rTkmgqUjOGCwmCpT50P9xOGjT7KysMNth7tzKAJDjA nRBGjSihsMHPMgnPkDMU+/Y/KSpTdjgLTh5QO3eNaSdOI8+lHaapJmZ/MhbfpiV3eeVa +JYgRT+Ks9KcT6lyC68UalvMh2fd59Xs98DaM= Received: by 10.216.186.16 with SMTP id v16mr338521wem.133.1274400299100; Thu, 20 May 2010 17:04:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.17.197 with HTTP; Thu, 20 May 2010 17:04:39 -0700 (PDT) In-Reply-To: References: From: Jonathan Ellis Date: Thu, 20 May 2010 17:04:39 -0700 Message-ID: Subject: Re: Ring out of sync, cassandra_UnavailableException being thrown To: user@cassandra.apache.org, keith@raptr.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Were you bootstrapping or otherwise moving nodes around? I don't think anyone's tracked this bug down farther than "if you restart the entire cluster, it goes away." On Wed, May 19, 2010 at 10:05 PM, Keith Thornhill wrote: > in a 5 node cluster, i noticed in our client error log that one of the > nodes was consistently throwing cassandra_UnavailableException during > a read operation. > > looking into jmx, it was obvious that one of the node's view of the > ring was out of sync. > > $ nodetool -host 192.168.20.150 ring > Address =A0 =A0 =A0 Status =A0 =A0 Load =A0 =A0 =A0 =A0 =A0Range > =A0 =A0 =A0 =A0 =A0 Ring > > 139508497374977076191526400448759597506 > 192.168.20.156Up =A0 =A0 =A0 =A0 5.73 GB > 733665530305941485083898696792520436 =A0 =A0 =A0 |<--| > 192.168.20.158Up =A0 =A0 =A0 =A0 3.41 GB > 9629533262984150011756238989685472219 =A0 =A0 =A0| =A0 ^ > 192.168.20.154Up =A0 =A0 =A0 =A0 2.44 GB > 31048334058970902242412812423471654868 =A0 =A0 v =A0 | > 192.168.20.150Up =A0 =A0 =A0 =A0 4.89 GB > 105769574715070648260922426249777160699 =A0 =A0| =A0 ^ > 192.168.20.152Up =A0 =A0 =A0 =A0 5.24 GB > 139508497374977076191526400448759597506 =A0 =A0|-->| > > $ nodetool -host 192.168.20.158 ring > Address =A0 =A0 =A0 Status =A0 =A0 Load =A0 =A0 =A0 =A0 =A0Range > =A0 =A0 =A0 =A0 =A0 Ring > 192.168.20.158Up =A0 =A0 =A0 =A0 3.41 GB > 9629533262984150011756238989685472219 =A0 =A0 =A0|<--| > > looking at the CF stats on that node, it is obvious that reads and > writes are happening, but i have to assume that those are coming from > proxy connections via the other nodes. > > when restarting that node, the error logs in the other cluster nodes > show that they detect the server going away and then coming back into > the ring. > > INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448 > OutboundTcpConnection.java (line 102) error writing to /192.168.20.158 > INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475 > OutboundTcpConnection.java (line 102) error writing to /192.168.20.158 > INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node > /192.168.20.158 has restarted, now UP again > INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538) > Node /192.168.20.158 state jump to normal > > any ideas on how to kick that node and remind it of its buddies? > > thanks! > -keith > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com