incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hendry" <dan.hendry.j...@gmail.com>
Subject SEVERE Data Corruption Problems
Date Wed, 09 Feb 2011 22:31:07 GMT
I have been having SEVERE data corruption issues with SSTables in my
cluster, for one CF it was happening almost daily (I have since shut down
the service using that CF as it was too much work to manage the Cassandra
errors). At this point, I can't see how it is anything but a Cassandra bug
yet it's somewhat strange and very scary that I am the only one who seems to
be having such serious issues. Most of my data is indexed in two ways so I
have been able to write a validator which goes through and back fills
missing data but it's kind of defeating the whole point of Cassandra. The
only way I have found to deal with issues when they crop up to prevent nodes
crashing from repeated failed compactions is delete the SSTable. My cluster
is running a slightly modified 0.7.0 version which logs what files errors
for so that I can stop the node and delete them.

 

The problem: 

-          Reads, compactions and hinted handoff fail with various
exceptions (samples shown at the end of this email) which seem to indicate
sstable corruption.

-          I have seen failed reads/compactions/hinted handoff on 4 out of 4
nodes (RF=2) for 3 different super column families and 1 standard column
family (4 out of 11) and just now, the Hints system CF. (if it matters the
ring has not changed since one CF which has been giving me trouble was
created). I have check SMART disk info and run various diagnostics and there
does not seem to be any hardware issues, plus what are the chances of all
four nodes having the same hardware problems at the same time when for all
other purposes, they appear fine?

-          I have added logging which outputs what sstable are causing
exceptions to be thrown. The corrupt sstables have been both freshly flushed
memtables and the output of compaction (ie, 4 sstables which all seem to be
fine get compacted to 1 which is then corrupt). It seems that the majority
of corrupt sstables are post-compacted (vs post-memtable flush).

-          The one CF which was giving me the most problems was heavily
written to (1000-1500 writes/second continually across the cluster). For
that cf, was having to deleting 4-6 sstables a day across the cluster (and
the number was going up, even the number of problems for remaining CFs is
going up). The other CFs which have had corrupt sstables are also quite
heavily written to (generally a few hundred writes a second across the
cluster).

-          Most of the time (5/6 attempts) when this problem occurs,
sstable2json also fails. I have however, had one case where I was able to
export the sstable to json, then re-import it at which point I was no longer
seeing exceptions.

-          The cluster has been running for a little over 2 months now,
problem seems to have sprung up in the last 3-4 weeks and seems to be
steadily getting worse.

 

Ultimately, I think I am hitting some subtle race condition somewhere. I
have been starting to dig into the Cassandra code but I barely know where to
start looking. I realize I have not provided nearly enough information to
easily debug the problem but PLEASE keep your eyes open for possibly racy or
buggy code which could cause these sorts of problems. I am willing to
provided full Cassandra logs and a corrupt SSTable on an individual basis:
please email me and let me know.

 

Here is possibly relevant information and my theories on a possible root
cause. Again, I know little about the Cassandra code base and have only
moderate java experience so these theories may be way off base.

-          Strictly speaking, I probably don't have enough memory for my
workload. I see stop the world gc occurring ~30/day/node, often causing
Cassandra to hang for 30+ seconds (according to the gc logs). Could there be
some java bug where a full gc in the middle of writing or flushing
(compaction/memtable flush) or doing some other disk based activity causes
some sort of data corruption?

-          Writes are usually done at ConsistencyLevel ONE with additional
client side retry logic. Given that I often see consecutive nodes in the
ring down, could there be some edge condition where dying at just the right
time causes parts of mutations/messages to be lost?

-          All of the CFs which have been causing me problems have large
rows which are compacted incrementally. Could there be some problem with the
incremental compaction logic?

-          My cluster has a fairly heavy write load (again, the most
problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node).
Furthermore, it is highly probable that there are timestamp collisions.
Could there be some issue with timestamp logic (ie, using > instead of >= or
some such) during flushes/compaction?

-          Once a node 

 

Cluster/system information:

-          4 nodes with RF=2

-          Nodes have 8 cores with 24 GB of RAM a piece.

-          2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data

-          OS is Ubuntu 10.04 (uname -r = 2.6.32-24-server)

-          Java:

o   java version "1.6.0_22"

o   Java(TM) SE Runtime Environment (build 1.6.0_22-b04)

o   Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

-          Slightly modified (file information in exceptions) version of
0.7.0

 

The following non-standard cassandra.yaml properties have been changed:

-          commitlog_sync_period_in_ms: 100 (with commitlog_sync: periodic)

-          disk_access_mode: mmap_index_only

-          concurrent_reads: 12

-          concurrent_writes: 2 (was 32, but I dropped it to 2 to try and
eliminate any mutation race conditions - did not seem to help)

-          sliced_buffer_size_in_kb: 128

-          in_memory_compaction_limit_in_mb: 50

-          rpc_timeout_in_ms: 15000

 

Schema for most problematic CF:

name: DeviceEventsByDevice

column_type: Standard

memtable_throughput_in_mb: 150

memtable_operations_in_millions: 1.5

gc_grace_seconds: 172800

keys_cached: 1000000

rows_cached: 0

 

Dan Hendry

(403) 660-2297

 


Mime
View raw message