Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D72EB17893 for ; Wed, 4 Mar 2015 00:57:09 +0000 (UTC) Received: (qmail 85747 invoked by uid 500); 4 Mar 2015 00:57:04 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 85699 invoked by uid 500); 4 Mar 2015 00:57:04 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 85687 invoked by uid 99); 4 Mar 2015 00:57:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2015 00:57:04 +0000 Date: Wed, 4 Mar 2015 00:57:04 +0000 (UTC) From: "Kishan Karunaratne (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-8902) Missing data files, database corruption MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kishan Karunaratne updated CASSANDRA-8902: ------------------------------------------ Reproduced In: 2.0.12, 2.0.11 (was: 2.0.11, 2.0.12) Description: During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204): {noformat} ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main] java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52) at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399) at org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217) at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272) at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278) at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:233) at org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) at org.apache.cassandra.io.compress.CompressedRandomAccessReader.(CompressedRandomAccessReader.java:76) at org.apache.cassandra.io.compress.CompressedThrottledReader.(CompressedThrottledReader.java:34) at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48) ... 17 more {noformat} I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption. The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives: {noformat} $ cassandra/bin/nodetool -h 10.240.61.210 status -- Address Load Tokens Owns Host ID Rack DN 10.240.210.69 533.32 MB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 DN 10.240.185.204 570.86 MB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 UN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 {noformat} While on another node it gives: {noformat} $ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204) -- Address Load Tokens Owns Host ID Rack UN 10.240.210.69 4.83 GB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 UN 10.240.185.204 4.88 GB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 DN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 {noformat} In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210): {noformat} INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main] java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1 at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88) at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} There are no error messages in the 1st node (10.240.210.69). I've attempted a reboot of the cluster, but the nodes remain in the above described state. was: During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204): {noformat} ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main] java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52) at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399) at org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217) at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272) at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278) at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:233) at org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) at org.apache.cassandra.io.compress.CompressedRandomAccessReader.(CompressedRandomAccessReader.java:76) at org.apache.cassandra.io.compress.CompressedThrottledReader.(CompressedThrottledReader.java:34) at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48) ... 17 more {noformat} I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption. The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives: {noformat} $ cassandra/bin/nodetool -h 10.240.61.210 status -- Address Load Tokens Owns Host ID Rack DN 10.240.210.69 533.32 MB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 DN 10.240.185.204 570.86 MB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 UN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 {noformat} While on another node it gives: {noformat} $ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204) -- Address Load Tokens Owns Host ID Rack UN 10.240.210.69 4.83 GB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 UN 10.240.185.204 4.88 GB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 DN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 {noformat} In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210): {noformat} INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main] java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1 at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88) at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} There are no error messages in the 1st node (10.240.210.69). > Missing data files, database corruption > --------------------------------------- > > Key: CASSANDRA-8902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8902 > Project: Cassandra > Issue Type: Bug > Environment: ruby-driver 2.1.0 | C* 2.0.12 > Reporter: Kishan Karunaratne > > During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204): > {noformat} > ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main] > java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) > at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52) > at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399) > at org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) > at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205) > at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217) > at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272) > at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278) > at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory) > at java.io.RandomAccessFile.open(Native Method) > at java.io.RandomAccessFile.(RandomAccessFile.java:233) > at org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) > at org.apache.cassandra.io.compress.CompressedRandomAccessReader.(CompressedRandomAccessReader.java:76) > at org.apache.cassandra.io.compress.CompressedThrottledReader.(CompressedThrottledReader.java:34) > at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48) > ... 17 more > {noformat} > I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption. > The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives: > {noformat} > $ cassandra/bin/nodetool -h 10.240.61.210 status > -- Address Load Tokens Owns Host ID Rack > DN 10.240.210.69 533.32 MB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 > DN 10.240.185.204 570.86 MB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 > UN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 > {noformat} > While on another node it gives: > {noformat} > $ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204) > -- Address Load Tokens Owns Host ID Rack > UN 10.240.210.69 4.83 GB 256 32.3% 2947fe5e-f149-4ff6-b26c-570ae72b7606 RAC1 > UN 10.240.185.204 4.88 GB 256 36.7% 3a6e2152-c7dc-457a-a4c5-4c6f01986dd0 RAC1 > DN 10.240.61.210 877.43 MB 256 31.0% c3b1beff-9587-4851-85a9-05a9ba6deaff RAC1 > {noformat} > In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210): > {noformat} > INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables > ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main] > java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1 > at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88) > at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} > There are no error messages in the 1st node (10.240.210.69). I've attempted a reboot of the cluster, but the nodes remain in the above described state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)