cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-1717) Cassandra cannot detect corrupt-but-readable column data
Date Fri, 05 Aug 2011 10:57:27 GMT


Sylvain Lebresne commented on CASSANDRA-1717:

My 2 cents:

I see 3 options that seems to make sense somehow:
# checksums at the column level:
  ** pros: easy to do, easy to recover from a bitrot and efficiently so (efficiently in that
in general we would be able to only drop one column for a given bitrot; it's more complicated
if something in the row header (row key, row size, ...) is bitrotten though).
  ** cons: high overhead (mainly in disk space usage but also on cpu usage because we have
much more checksums to check)
# checksums at the row level (or column index level, but I think this is essentially the same,
isn't it?):
  ** pros: easy to recover from bitrot (we drop the row), though potentially more wasteful
than "column level". Incurs a small space overhead for big rows.
  ** cons: can't realistically check on every reads, so we need to do it only on compaction/repair
and on read digest mismatch (that last one is non optional if we want checksums to be sure
in that bitrot never propagate to other node); this adds complexity and some I/O to check
checksums on read digest mismatch that is not necessary (read digest mismatch won't in general
be due to bitrot). Also incurs a important space overhead for tiny rows.
# checksums at the block level:
  ** pros: super easy in the compressed case (can be done "on every read", or more precisely
each time we read a block). Incurs a minimum overhead.
  ** cons: super *not* easy in the non-compressed case. We don't have blocks in the uncompressed
case. While writing, we could use the buffer size as a block size and add a checksum on flush.
The problems are on reads however.  First, we would need to align buffers on reads (which
we don't do in the non-compressed case) as Pavel said, which likely involves more reBuffer
in general (aka more I/O). But perhaps more importantly, I have no clue how you could make
that work with mmap efficiently (we would potentially have a checksum in the middle of a column
value as far as mmap is concerned).  Also slightly harder to recover from bitrot without dropping
the whole sstable (but doable as long as we have the index around).

There may be other solutions I don't see, and there may be some pros/cons for the ones above
that I have missed (please feel free to complete).

But based on those, my personal opinion is that "column level" has too big an overhead and
"block level" is really problematic in the mmap non-compressed case (but sound like the best
option to me if we ignore mmap).

So my personal preference leans towards using "block level" but only having checksums in the
compressed case and maybe in an uncompressed mode for which mmap would be deactivated.

If we really don't want to consider that, "row level" checksums would maybe be the lesser
evil. But I'm not fond of the overhead in case of tiny rows and the 'check checksums on read
digest mismatch', while I believe necessary in that case, doesn't sound like the best idea

> Cassandra cannot detect corrupt-but-readable column data
> --------------------------------------------------------
>                 Key: CASSANDRA-1717
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>             Fix For: 1.0
>         Attachments: checksums.txt
> Most corruptions of on-disk data due to bitrot render the column (or row) unreadable,
so the data can be replaced by read repair or anti-entropy.  But if the corruption keeps column
data readable we do not detect it, and if it corrupts to a higher timestamp value can even
resist being overwritten by newer values.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message