cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Hanna (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6082) 1.1.12 --> 1.2.x upgrade may result inconsistent ring
Date Wed, 11 Dec 2013 17:27:07 GMT


Jeremy Hanna commented on CASSANDRA-6082:

It looks like it may be due to the older nodes in the gossipinfo, for example:

[ldc02] out: /
[ldc02] out:   REMOVAL_COORDINATOR:REMOVER,159507359494189904748456847233641349120
[ldc02] out:   SCHEMA:977aba47-abf7-11e1-9d47-3f10d3dde90f
[ldc02] out:   LOAD:44182.0
[ldc02] out:   STATUS:removed,85070591730234615865843651857942052864
[ldc02] out:   RELEASE_VERSION:0.7.10

Older versions didn't TTL the gossip info, so it may just be a matter of using the assassinate
operation to get rid of the older nodes with that jmx call in the mbeans:>Gossiper->unsafeAssassinateEndpoint
with the IP addresses. 

> 1.1.12 --> 1.2.x upgrade may result inconsistent ring
> -----------------------------------------------------
>                 Key: CASSANDRA-6082
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 1.1.12 --> 1.2.9
>            Reporter: Chris Burroughs
>            Priority: Minor
>         Attachments: c-gossipinfo, c-status
> This happened to me once, and since I don't have any more 1.1.x clusters I won't be testing
again.  I hope the attached files are enough for someone to connect the dots.
> I did a rolling restart to upgrade from 1.1.12 --> 1.2.9.  About a week later I discovered
that one node was in an inconsistent state in the ring.  It was either:
>  * up
>  * host-id=null
>  * missing
> Depending on which node I ran nodetool status from.  I *think* I just missed this during
the upgrade but can not rule out the possibility that it "just happened for no reason" some
time after the upgrade.  It was detected when running repair in such a ring caused all sorts
of terrible data "duplication" and performance tanked.  Restarting the seeds + "bad" node
caused the ring to be consistent again.
> Two possibly suspicious things are a ArrayIndexOutOfBoundsException on startup:
> {noformat}
> ERROR [GossipStage:1] 2013-09-06 10:45:35,213 (line 194) Exception
in thread Thread[GossipStage:1,5,main]
> java.lang.ArrayIndexOutOfBoundsException: 2
>         at org.apache.cassandra.service.StorageService.extractExpireTime(
>         at org.apache.cassandra.service.StorageService.handleStateRemoving(
>         at org.apache.cassandra.service.StorageService.onChange(
>         at org.apache.cassandra.service.StorageService.onJoin(
>         at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(
>         at org.apache.cassandra.gms.Gossiper.applyStateLocally(
>         at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(
>         at
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at
> {noformat}
> and problems to hint delivery to multiple node.
> {noformat}
> ERROR [MutationStage:11] 2013-09-06 13:59:19,604 (line 194) Exception
in thread Thread[MutationStage:11,5,main]
> java.lang.AssertionError: Missing host ID for
>         at org.apache.cassandra.service.StorageProxy.writeHintForMutation(
>         at org.apache.cassandra.service.StorageProxy$5.runMayThrow(
>         at org.apache.cassandra.service.StorageProxy$
>         at java.util.concurrent.Executors$
>         at java.util.concurrent.FutureTask$Sync.innerRun(
>         at
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at
> {noformat}
> Not however that while there were delivery problems to multiple nodes during the rolling
upgrade, only one node was in a funky state a week later.
> Attached are the results of running gossipinfo and status on every node.

This message was sent by Atlassian JIRA

View raw message