cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10534) CompressionInfo not being fsynced on close
Date Fri, 13 Nov 2015 14:43:11 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004049#comment-15004049
] 

Ariel Weisberg commented on CASSANDRA-10534:
--------------------------------------------

Yes please open new low priority ticket. It can't be a ton of work to sync the TOC and digest
and it would reduce the window where they are inconsistent.

A bigger issue brought up by this is that we don't do power failure testing in a loop so we
can't actually verify the correctness of Cassandra under those conditions. Can you create
a major priority ticket for that?

> CompressionInfo not being fsynced on close
> ------------------------------------------
>
>                 Key: CASSANDRA-10534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10534
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sharvanath Pathak
>            Assignee: Stefania
>             Fix For: 2.1.x
>
>
> I was seeing SSTable corruption due to a CompressionInfo.db file of size 0, this happened
multiple times in our testing with hard node reboots. After some investigation it seems like
these file is not being fsynced, and that can potentially lead to data corruption. I am working
with version 2.1.9.
> I checked for fsync calls using strace, and found them happening for all but the following
components: CompressionInfo, TOC.txt and digest.sha1. All of these but the CompressionInfo
seem tolerable. Also a quick look through the code did not reveal any fsync calls. Moreover,
I suspect the commit  4e95953f29d89a441dfe06d3f0393ed7dd8586df (https://github.com/apache/cassandra/commit/4e95953f29d89a441dfe06d3f0393ed7dd8586df#diff-b7e48a1398e39a936c11d0397d5d1966R344)
has caused the regression, which removed the line
> {noformat}
>  getChannel().force(true);
> {noformat}
> from CompressionMetadata.Writer.close.
> Following is the trace I saw in system.log:
> {noformat}
> INFO  [SSTableBatchOpen:1] 2015-09-29 19:24:39,170 SSTableReader.java:478 - Opening /var/lib/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-ka-13368
(79 bytes)
> ERROR [SSTableBatchOpen:1] 2015-09-29 19:24:39,177 FileUtils.java:447 - Exiting forcefully
due to file system exception on startup, disk failure policy "stop"
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
>         at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:131)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:85)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:79)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:72)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:168)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:752)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:703)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:491)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:387)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:534)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_80]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_80]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_80]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_80]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
> Caused by: java.io.EOFException: null
>         at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) ~[na:1.7.0_80]
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589) ~[na:1.7.0_80]
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564) ~[na:1.7.0_80]
>         at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:106)
~[apache-cassandra-2.1.9.jar:2.1.9]
>         ... 14 common frames omitted
> {noformat}
> Following is the result of ls on the data directory of a corrupted SSTable after the
hard reboot:
> {noformat}
> $ ls -l /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/
> total 60
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 system-sstable_activity-ka-1-CompressionInfo.db
> -rw-r--r-- 1 cassandra cassandra  9740 Oct 15 09:31 system-sstable_activity-ka-1-Data.db
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 system-sstable_activity-ka-1-Digest.sha1
> -rw-r--r-- 1 cassandra cassandra   880 Oct 15 09:31 system-sstable_activity-ka-1-Filter.db
> -rw-r--r-- 1 cassandra cassandra 34000 Oct 15 09:31 system-sstable_activity-ka-1-Index.db
> -rw-r--r-- 1 cassandra cassandra  7338 Oct 15 09:31 system-sstable_activity-ka-1-Statistics.db
> -rw-r--r-- 1 cassandra cassandra     0 Oct 15 09:31 system-sstable_activity-ka-1-TOC.txt
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message