cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3910) make phi_convict_threshold Float
Date Sat, 18 Feb 2012 07:03:59 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210854#comment-13210854
] 

Peter Schuller commented on CASSANDRA-3910:
-------------------------------------------

So a phi of 8-9 or so should result in a down after something like 10+ seconds (off the top
of my head; but point is, "several seconds") assuming gossip delays are dominating the heartbeat
propagation (as opposed to networking issues, in which case it would take longer).

Questions:

* If you just have these two nodes sit idle w/o traffic, do you see hosts being kicked into
down state spontaneously?
** If yes, something is either buggy or your network conditions are *extremely* poor. I presume
the answer to this is "no", but I wanted to ask to be sure.

* Under traffic conditions when you are observing this flapping, how much data are you pushing
between these two nodes? Are you throwing traffic "as fast as possible" (un-throttled benchmark
client which isn't bottlenecking) or at some pre-set pace? What is the actual bandwidth and
how does it relate to the expected throughput on a TCP connection between the two nodes?

I am mostly trying to confirm what's going on. It sounds to me like you're likely shoving
more down that TCP pipe than you can reliably sustain on average, and in the event of a hiccup
on the TCP connection, you're pushing enough traffic that gets queued that the delay in gossip
is just due to the time it takes to catch up with the requests.

It strikes me that "invalid" downs due to this would be most effectively solved by having
Gossip messages be prioritized (or even just on a separate connection, but that would be more
work patch wise) when enqueued on the TCP connection. If they always are prioritized, you
wouldn't see delays in gossip messages other than due to networking conditions so bad that
not even that tiny bit of information is making it through. (This only makes sense though
if you don't expect the failure detector to help with congestion.)

But, since you're also trying to use the FD to avoid queueing up messages, it doesn't actually
solve *your* problem. And if you didn't care about that, you could just up the phi conviction
threshold even more until you don't see flapping. That's assuming the overall average bandwidth
is high enough to sustain your traffic pattern.

                
> make phi_convict_threshold Float
> --------------------------------
>
>                 Key: CASSANDRA-3910
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3910
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.7
>            Reporter: Radim Kolar
>
> I would like to have phi_convict_threshold floating point number instead of integer.
Value 8 is too low for me and value 9 is too high. With converting to floating point, it can
be better fine tuned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message