Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 4 Mar 2015 00:57:04 +0000 (UTC)
From: "Kishan Karunaratne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12779250.1425429378000.70666.1425430624460@Atlassian.JIRA>
In-Reply-To: <JIRA.12779250.1425429378000@Atlassian.JIRA>
References: <JIRA.12779250.1425429378000@Atlassian.JIRA>
 <JIRA.12779250.1425429378616@arcas>
Subject: [jira] [Updated] (CASSANDRA-8902) Missing data files, database
 corruption
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CASSANDRA-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kishan Karunaratne updated CASSANDRA-8902:
------------------------------------------
    Reproduced In: 2.0.12, 2.0.11  (was: 2.0.11, 2.0.12)
      Description: 
During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204):
{noformat}
ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main]
java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
        at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399)
        at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
        at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205)
        at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217)
        at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272)
        at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278)
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
        at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
        ... 17 more
{noformat}

I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption.

The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives:

{noformat}
$ cassandra/bin/nodetool -h 10.240.61.210 status

--  Address         Load       Tokens  Owns   Host ID                               Rack
DN  10.240.210.69   533.32 MB  256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
DN  10.240.185.204  570.86 MB  256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
UN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
{noformat}

While on another node it gives:

{noformat}
$ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204)
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  10.240.210.69   4.83 GB    256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
UN  10.240.185.204  4.88 GB    256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
DN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
{noformat}

In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210):
{noformat}
INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables
ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main]
java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1
        at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
{noformat}

There are no error messages in the 1st node (10.240.210.69). I've attempted a reboot of the cluster, but the nodes remain in the above described state.

  was:
During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204):
{noformat}
ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main]
java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
        at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399)
        at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
        at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205)
        at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217)
        at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272)
        at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278)
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
        at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
        at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
        ... 17 more
{noformat}

I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption.

The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives:

{noformat}
$ cassandra/bin/nodetool -h 10.240.61.210 status

--  Address         Load       Tokens  Owns   Host ID                               Rack
DN  10.240.210.69   533.32 MB  256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
DN  10.240.185.204  570.86 MB  256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
UN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
{noformat}

While on another node it gives:

{noformat}
$ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204)
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  10.240.210.69   4.83 GB    256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
UN  10.240.185.204  4.88 GB    256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
DN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
{noformat}

In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210):
{noformat}
INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables
ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main]
java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1
        at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88)
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
{noformat}

There are no error messages in the 1st node (10.240.210.69).


> Missing data files, database corruption
> ---------------------------------------
>
>                 Key: CASSANDRA-8902
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8902
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: ruby-driver 2.1.0 | C* 2.0.12
>            Reporter: Kishan Karunaratne
>
> During a recent endurance test run of the ruby-driver (as well as a previous run), I see many of the following exceptions thrown in the system.log in the 2nd node (10.240.185.204):
> {noformat}
> ERROR [CompactionExecutor:81] 2015-02-20 22:32:33,064 CassandraDaemon.java (line 199) Exception in thread Thread[CompactionExecutor:81,1,main]
> java.lang.RuntimeException: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
>         at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
>         at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1399)
>         at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
>         at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1205)
>         at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1217)
>         at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:272)
>         at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:278)
>         at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:131)
>         at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>         at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
>         at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
>         at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> Caused by: java.io.FileNotFoundException: /srv/performance/cass/data/duration_test1/ints/duration_test1-ints-jb-39-Data.db (No such file or directory)
>         at java.io.RandomAccessFile.open(Native Method)
>         at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
>         at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
>         at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
>         at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
>         at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
>         ... 17 more
> {noformat}
> I've checked this data directory and indeed this specific db file is missing. This would signal a database corruption.
> The endurance test uses a 3-node cluster run over 3 days, with a chaos rhino randomly restarting one of the nodes. It seems like the nodes have also gone out of sync. For example, getting a nodetool status on one node gives:
> {noformat}
> $ cassandra/bin/nodetool -h 10.240.61.210 status
> --  Address         Load       Tokens  Owns   Host ID                               Rack
> DN  10.240.210.69   533.32 MB  256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
> DN  10.240.185.204  570.86 MB  256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
> UN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
> {noformat}
> While on another node it gives:
> {noformat}
> $ cassandra/bin/nodetool -h 10.240.210.69 status (or 10.240.185.204)
> --  Address         Load       Tokens  Owns   Host ID                               Rack
> UN  10.240.210.69   4.83 GB    256     32.3%  2947fe5e-f149-4ff6-b26c-570ae72b7606  RAC1
> UN  10.240.185.204  4.88 GB    256     36.7%  3a6e2152-c7dc-457a-a4c5-4c6f01986dd0  RAC1
> DN  10.240.61.210   877.43 MB  256     31.0%  c3b1beff-9587-4851-85a9-05a9ba6deaff  RAC1
> {noformat}
> In addition to seeing the missing db file (albeit a different one), I also see many occurrences of the following exception in the 3rd node (10.240.61.210):
> {noformat}
> INFO [MeteredFlusher:1] 2015-03-03 21:17:41,032 MeteredFlusher.java (line 86) Estimated 488539743 live and 265675077 flushing bytes used by all memtables
> ERROR [MeteredFlusher:1] 2015-03-03 21:17:41,033 CassandraDaemon.java (line 199) Exception in thread Thread[MeteredFlusher:1,5,main]
> java.lang.NoClassDefFoundError: org/apache/cassandra/db/MeteredFlusher$1
>         at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:88)
>         at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:80)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:724)
> {noformat}
> There are no error messages in the 1st node (10.240.210.69). I've attempted a reboot of the cluster, but the nodes remain in the above described state.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)