Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates
 74.125.82.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=qTTuY6gYZ4rwqP5d3G0VWQF+rTkmgqUjOGCwmCpT50P9xOGjT7KysMNth7tzKAJDjA
         nRBGjSihsMHPMgnPkDMU+/Y/KSpTdjgLTh5QO3eNaSdOI8+lHaapJmZ/MhbfpiV3eeVa
         +JYgRT+Ks9KcT6lyC68UalvMh2fd59Xs98DaM=
MIME-Version: 1.0
In-Reply-To: <AANLkTil5Qhy0lNeyOpChIoQaItxZhOgOHInyz6FPCKK2@mail.gmail.com>
References: <AANLkTil5Qhy0lNeyOpChIoQaItxZhOgOHInyz6FPCKK2@mail.gmail.com>
From: Jonathan Ellis <jbellis@gmail.com>
Date: Thu, 20 May 2010 17:04:39 -0700
Message-ID: <AANLkTikLzn56MWtpgK-_jrESU11jSKBlT0dfdpAD7dkW@mail.gmail.com>
Subject: Re: Ring out of sync, cassandra_UnavailableException being thrown
To: user@cassandra.apache.org, keith@raptr.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Were you bootstrapping or otherwise moving nodes around?

I don't think anyone's tracked this bug down farther than "if you
restart the entire cluster, it goes away."

On Wed, May 19, 2010 at 10:05 PM, Keith Thornhill <keith@raptr.com> wrote:
> in a 5 node cluster, i noticed in our client error log that one of the
> nodes was consistently throwing cassandra_UnavailableException during
> a read operation.
>
> looking into jmx, it was obvious that one of the node's view of the
> ring was out of sync.
>
> $ nodetool -host 192.168.20.150 ring
> Address =A0 =A0 =A0 Status =A0 =A0 Load =A0 =A0 =A0 =A0 =A0Range
> =A0 =A0 =A0 =A0 =A0 Ring
>
> 139508497374977076191526400448759597506
> 192.168.20.156Up =A0 =A0 =A0 =A0 5.73 GB
> 733665530305941485083898696792520436 =A0 =A0 =A0 |<--|
> 192.168.20.158Up =A0 =A0 =A0 =A0 3.41 GB
> 9629533262984150011756238989685472219 =A0 =A0 =A0| =A0 ^
> 192.168.20.154Up =A0 =A0 =A0 =A0 2.44 GB
> 31048334058970902242412812423471654868 =A0 =A0 v =A0 |
> 192.168.20.150Up =A0 =A0 =A0 =A0 4.89 GB
> 105769574715070648260922426249777160699 =A0 =A0| =A0 ^
> 192.168.20.152Up =A0 =A0 =A0 =A0 5.24 GB
> 139508497374977076191526400448759597506 =A0 =A0|-->|
>
> $ nodetool -host 192.168.20.158 ring
> Address =A0 =A0 =A0 Status =A0 =A0 Load =A0 =A0 =A0 =A0 =A0Range
> =A0 =A0 =A0 =A0 =A0 Ring
> 192.168.20.158Up =A0 =A0 =A0 =A0 3.41 GB
> 9629533262984150011756238989685472219 =A0 =A0 =A0|<--|
>
> looking at the CF stats on that node, it is obvious that reads and
> writes are happening, but i have to assume that those are coming from
> proxy connections via the other nodes.
>
> when restarting that node, the error logs in the other cluster nodes
> show that they detect the server going away and then coming back into
> the ring.
>
> INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448
> OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
> INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475
> OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
> INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node
> /192.168.20.158 has restarted, now UP again
> INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538)
> Node /192.168.20.158 state jump to normal
>
> any ideas on how to kick that node and remind it of its buddies?
>
> thanks!
> -keith
>


--=20
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com