cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Haggerty (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6125) Race condition in Gossip propagation
Date Wed, 22 Jul 2015 02:49:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636139#comment-14636139
] 

Peter Haggerty commented on CASSANDRA-6125:
-------------------------------------------

We've seen this bug or something like it on 2.0.11 with 45 nodes in a fairly noisy AWS environment
but other than CASSANDRA-8336 I don't see any fixes to gossip post 2.0.11.

The nodetool status command doesn't list the node that does't have status info. It's not up
or down, it's simply not there and this impacts % ownership.
In a recent instance of this 4 nodes had the same "status hole" but only 2 of the 4 had different
nodetool ring output compared to the other 41 "no status hole" members of the ring.

Restarting cassandra on the node that has a missing STATUS entry in gossip "fixes" the problem
in that the hole goes away. This is something we used to see more commonly before 2.0.11 so
it does appear this fix works but are there other places where a race might be happening?

{code}
/10.xx.yyy.169
  generation:1436544814
  heartbeat:2986679
  SEVERITY:0.0
  HOST_ID:7d22299f-b35b-4035-82bc-e2b603a655d7
  LOAD:2.555557836E11
  RACK:1e
  NET_VERSION:7
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.169
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
/10.xx.yyy.175
  generation:1419877470
  heartbeat:53496976
  SEVERITY:1.2787723541259766
  HOST_ID:c87ed8db-76b6-485a-ac2f-32c2822b1ef5
  LOAD:3.08812188602E11
  RACK:1e
  NET_VERSION:7
  STATUS:NORMAL,-1010822684895662807
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.175
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
{code}


> Race condition in Gossip propagation
> ------------------------------------
>
>                 Key: CASSANDRA-6125
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6125
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sergio Bossa
>            Assignee: Brandon Williams
>             Fix For: 2.0.11, 2.1.1
>
>         Attachments: 6125.txt
>
>
> Gossip propagation has a race when concurrent VersionedValues are created and submitted/propagated,
causing some updates to be lost, even if happening on different ApplicationStatuses.
> That's what happens basically:
> 1) A new VersionedValue V1 is created with version X.
> 2) A new VersionedValue V2 is created with version Y = X + 1.
> 3) V2 is added to the endpoint state map and propagated.
> 4) Nodes register Y as max version seen.
> 5) At this point, V1 is added to the endpoint state map and propagated too.
> 6) V1 version is X < Y, so nodes do not ask for his value after digests.
> A possible solution would be to propagate/track per-ApplicationStatus versions, possibly
encoding them to avoid network overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message