cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thibaut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2394) Faulty hd kills cluster performance
Date Wed, 13 Apr 2011 08:41:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019251#comment-13019251
] 

Thibaut commented on CASSANDRA-2394:
------------------------------------



In our case, the degredation never stopped. It didn't matter if we connected to the node itself,
or another node in the cluster. As soon as we killed the offending node, cluster performance
returned to normal again. We also use custom Hector loadbalancing policy (always prefering
to connect to the local node), before trying another node.

Not sure about what you mean with coordinator nodes? That cluster had 20 nodes, replication
level 3.

I will look into it more closely when we have a similar problem in the future.



> Faulty hd kills cluster performance
> -----------------------------------
>
>                 Key: CASSANDRA-2394
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2394
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Thibaut
>            Priority: Minor
>             Fix For: 0.7.5
>
>
> Hi,
> About every week, a node from our main cluster (>100 nodes) has a faulty hd  (Listing
the cassandra data storage directoy triggers an input/output error).
> Whenever this occurs, I see many timeoutexceptions in our application on various nodes
which cause everything to run very very slowly. Keyrange scans just timeout and will sometimes
never succeed. If I stop cassandra on the faulty node, everything runs normal again.
> It would be great to have some kind of monitoring thread in cassandra which marks a node
as "down" if there are multiple read/write errors to the data directories. A single faulty
hd on 1 node shouldn't affect global cluster performance.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message