cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thibaut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2394) Faulty hd kills cluster performance
Date Tue, 03 May 2011 13:46:03 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028227#comment-13028227
] 

Thibaut commented on CASSANDRA-2394:
------------------------------------

I have the same problem again (with dynamic snitch enabled this time). The cluster won't respont
to any queries anymore.

There are only very few Commands and Responses being processed, no exceptions in log. Kern.log
is full of hd read errors.

root@intr2n18:~# /software/cassandra/bin/nodetool -h localhost netstats
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0        4593983
Responses                       n/a         0        5276499

Is it possible to port this patch back to 0.7? Certainly everybody running cassandra on bigger
clusters on non raided hd's is affected by this.

> Faulty hd kills cluster performance
> -----------------------------------
>
>                 Key: CASSANDRA-2394
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2394
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Thibaut
>            Priority: Minor
>             Fix For: 0.7.6
>
>
> Hi,
> About every week, a node from our main cluster (>100 nodes) has a faulty hd  (Listing
the cassandra data storage directoy triggers an input/output error).
> Whenever this occurs, I see many timeoutexceptions in our application on various nodes
which cause everything to run very very slowly. Keyrange scans just timeout and will sometimes
never succeed. If I stop cassandra on the faulty node, everything runs normal again.
> It would be great to have some kind of monitoring thread in cassandra which marks a node
as "down" if there are multiple read/write errors to the data directories. A single faulty
hd on 1 node shouldn't affect global cluster performance.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message