cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (CASSANDRA-3533) TimeoutException when there is a firewall issue.
Date Fri, 05 Apr 2013 12:42:16 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brandon Williams reopened CASSANDRA-3533:
-----------------------------------------


Something's wrong here, because I'm randomly seeing these in the dtests:

{noformat}
 INFO [main] 2013-04-05 04:53:22,574 ThriftServer.java (line 90) Binding thrift service to
/127.0.0.2:9160
 INFO [main] 2013-04-05 04:53:22,622 ThriftServer.java (line 102) Using TFramedTransport with
a max frame size of 15728640 bytes.
ERROR [GossipStage:1] 2013-04-05 04:53:23,048 CassandraDaemon.java (line 179) Exception in
thread Thread[GossipStage:1,5,main]
java.lang.AssertionError
    at org.apache.cassandra.service.EchoVerbHandler.doVerb(EchoVerbHandler.java:17)
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)
{noformat}
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Vijay
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: 0001-CASSANDRA-3533.patch, 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to firewall
or network related issue (StorageProxy calls fail), and the nodes are NOT marked down because
at least one node in the cluster can talk to the other DC/RAC, we get timeoutException instead
of throwing a unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time to know
that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the node is actually
alive by trying to communicate to it? So we can be sure that the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message