cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9100) Gossip is inadequately tested
Date Fri, 24 Apr 2015 21:58:39 GMT


Jason Brown commented on CASSANDRA-9100:

TL;DR I think some dtests/ccm are the way to go for now.

Last summer, I built a simulator for our gossip so I could understand it further and see where
it starts to break down. It took me about 2.5 weeks just to pull apart the gossip components
from the rest of the system so I could run them in isolation - meaning, have more than one
Gossiper executing in a siungle JVM. The changes included a series hack that broke many other
components, like MessasingService (but that was acceptable for the simulator), and I'm not
sure the rest of cassandra was totally legit with the hacks, either (except Gossiper, of course).
I did have a workable simulator after the effort, but didn't have much time to work on it
beyond that (maybe prep work for my various gossip talks) to invest into the simulator.

This being said, I think it's an incredibly non-trivial effort to tease gossip out for testing
due to all the singletons, as [~brandon.williams] mentioned. I think some good wins, however,
could be gained by adding in some dtests - but then, the question is "what to monitor for
indications of sucess/failure?". I'm not sure there's a fantastic answer here. The (limited)
possibilities include nodetool output, log file scraping, and ... ? I'd be most inclined for
nodetool output, but we already scrape log files in dtests (I think), so that's not without
precendent; but it also depends on what is being tested.

Thinking on it more, and, if it's even possible, it might be neat to script some iptables
manipulation into dtests to block IPs/ports from communicating, then observe that gossip behaves
as expected. Think of it as "mini-Jepsen", and testing gossip in the face of network partitions
seems like apropos place for that kind of testing.

> Gossip is inadequately tested
> -----------------------------
>                 Key: CASSANDRA-9100
>                 URL:
>             Project: Cassandra
>          Issue Type: Test
>          Components: Core
>            Reporter: Ariel Weisberg
> We found a few unit tests, but nothing that exercises Gossip under challenging conditions.
Maybe consider a long test that hooks up some gossipers over a fake network and then do fault
injection on that fake network. Uni-directional and bi-directional partitions, delayed delivery,
out of order delivery if that is something that they can see in practice. Connects/disconnects.
> Also play with bad clocks.

This message was sent by Atlassian JIRA

View raw message