In my opinion the #1 risk for corruption is user/client error with the timestamps.  Over time, Cassandra flushes data from memory to disks.  After it flushes to disk, Cassandra doesn't go back to delete or modify that data.  Because of this, deletes are performed by writing a "tombstone" to disk.  This can lead to corruption if you attempt to change the timestamps your clients are producing after data has been inserted.  For example, if you originally were using microseconds for timestamps, you may have inserted a record with a timestamp of 1234567000000.  If you switched your Cassandra clients to use seconds for the timestamp and attempted to delete a record 1 second later, the tombstone would be placed at 1234567, and since 1234567 < 1234567000000 the record would not be deleted.  A de-facto standard of microseconds has been recommended to clients, but it's important to ensure consistency if you switch clients or start using a client in a different language.

More discussion on timestamps: http://comments.gmane.org/gmane.comp.db.cassandra.devel/1165

-Ben


On Tue, Jun 8, 2010 at 10:45 PM, Hector Urroz <hector@magpieti.com> wrote:
Hi all,

We're starting to prototype Cassandra for use in a production system and became concerned about data corruption after reading the excellent article:
where Evan Weaver writes:

"Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is subject to change, making upgrading a bit difficult."

Is database corruption a well-known or common problem with Cassandra? What sources of information would you recommend to help devise a strategy to minimize corruption risk, and to detect and recover when corruption does occur?

Thanks,

Hector Urroz