cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes
Date Sun, 02 Oct 2011 20:03:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119054#comment-13119054
] 

Jonathan Ellis commented on CASSANDRA-3294:
-------------------------------------------

What do you suggest?  TCP connection death isn't synonymous with process death.
                
> a node whose TCP connection is not up should be considered down for the purpose of reads
and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process gets killed
and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts
and thousands of hung requests to realize that we shouldn't be sending messages to a node
when the only possible means of communication is confirmed down. This is why one has to "disablegossip
and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540
but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of currently outstanding
RPC requests to a node, would likely take care of this case as well. But until such a thing
exists and works well, it seems prudent to have the very common and controlled form of "failure"
be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something "really down"
(and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are
different concepts (say one is "currently unreachable" rather than "down") being conflated.
But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it
seems unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message