Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF9765BCC for ; Thu, 12 May 2011 07:55:36 +0000 (UTC) Received: (qmail 20600 invoked by uid 500); 12 May 2011 07:55:36 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 20555 invoked by uid 500); 12 May 2011 07:55:34 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 20546 invoked by uid 99); 12 May 2011 07:55:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 07:55:33 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 07:55:27 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 8C23282EFA for ; Thu, 12 May 2011 07:54:47 +0000 (UTC) Date: Thu, 12 May 2011 07:54:47 +0000 (UTC) From: "Thibaut (JIRA)" To: commits@cassandra.apache.org Message-ID: <2038357148.6187.1305186887570.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <691598271.15623.1301261045754.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CASSANDRA-2394) Faulty hd kills cluster performance MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032294#comment-13032294 ] Thibaut commented on CASSANDRA-2394: ------------------------------------ Another hd died. This time, there were ERRORS in the log: ERROR [ReadStage:336] 2011-05-11 14:35:53,232 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[ReadStage:336,5,main] java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:60) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.io.sstable.SSTableScanner.seekTo(SSTableScanner.java:104) at org.apache.cassandra.db.RowIteratorFactory.getIterator(RowIteratorFactory.java:96) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1447) at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:49) ... 4 more Caused by: java.io.IOException: Input/output error at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:206) at org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:347) at org.apache.cassandra.io.sstable.SSTableScanner.seekTo(SSTableScanner.java:99) ... 7 more Together with: WARN [ScheduledTasks:1] 2011-05-11 12:24:35,725 MessagingService.java (line 504) Dropped 10 READ messages in the last 5000ms WARN [ScheduledTasks:1] 2011-05-11 12:24:35,725 MessagingService.java (line 504) Dropped 17 RANGE_SLICE messages in the last 5000ms INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 51) Pool Name Active Pending INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66) ReadStage 16 1310 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66) RequestResponseStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66) ReadRepairStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,726 StatusLogger.java (line 66) MutationStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) GossipStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) AntiEntropyStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) MigrationStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) StreamStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) MemtablePostFlusher 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,727 StatusLogger.java (line 66) FILEUTILS-DELETE-POOL 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66) FlushWriter 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66) MiscStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66) FlushSorter 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66) InternalResponseStage 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 66) HintedHandoff 0 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,728 StatusLogger.java (line 70) CompactionManager n/a 0 INFO [ScheduledTasks:1] 2011-05-11 12:24:35,729 StatusLogger.java (line 82) MessagingService n/a 0,0 > Faulty hd kills cluster performance > ----------------------------------- > > Key: CASSANDRA-2394 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2394 > Project: Cassandra > Issue Type: Bug > Affects Versions: 0.7.4 > Reporter: Thibaut > Priority: Minor > Fix For: 0.7.6 > > > Hi, > About every week, a node from our main cluster (>100 nodes) has a faulty hd (Listing the cassandra data storage directoy triggers an input/output error). > Whenever this occurs, I see many timeoutexceptions in our application on various nodes which cause everything to run very very slowly. Keyrange scans just timeout and will sometimes never succeed. If I stop cassandra on the faulty node, everything runs normal again. > It would be great to have some kind of monitoring thread in cassandra which marks a node as "down" if there are multiple read/write errors to the data directories. A single faulty hd on 1 node shouldn't affect global cluster performance. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira