incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ranking Lekarzy <rankinglekarzy....@gmail.com>
Subject DB file corruption
Date Thu, 20 Mar 2014 14:16:49 GMT
Hello,

I have a cluster running cassandra 1.2.12. On one node I'm getting
exceptions about corruption detected in one of the DB files. Exceptions
occurred when I was trying to run the upgradesstables nodetool command.
After this exception upgradesstables couldn't continue.
Then I decided to run nodetool scrub on corrupted columnfamily. It failed
with the same exception. I found another approach with offline scrubbing
using sstablescrub utility. It complained about the same corrupted file but
I expected that it would throw away the corrupted rows and rebuild this
file. Below some log message from sstablescrub:

Scrubbing
SSTableReader(path='/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db')
WARNING: Non-fatal error reading row (stacktrace follows)
WARNING: Row at 83742783839 is unreadable; skipping to next
Error scrubbing
SSTableReader(path='/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db'):
org.apache.cassandra.io.compress.CorruptBlockException:
(/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db): corruption
detected, chunk at 71488440767 of length 55562.


After sstablescrub had finished, the corrupted file did not seem modified.
All files had been recreated, except for the corrupted one.
I anyway tried to run upgradesstables but this time in offline mode
with sstableupgrade utility.
It failed with a similar exception:

Found 1 sstables that need upgrading.
Upgrading
SSTableReader(path='/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db')
Error upgrading
SSTableReader(path='/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db'):
org.apache.cassandra.io.compress.CorruptBlockException:
(/var/lib/cassandra/data/Foo/Bar/Foo-Bar-hf-6247-Data.db): corruption
detected, chunk at 71488440767 of length 55562.


Cluster has RF = 3 so it should be safe to just remove this file and run
repair. But I would prefer to fix this file instead of removing. It is
possible?

I've compared this CF on each node with cfstats and it looks like on the
node with corrupted data the CF size is double of size on the other nodes.
Corruption is there for a long time so all repairs were failing with this
exception. I'm a bit worry if this CF is properly replicated.

Do you have any suggestion how to safely recover this CF?

Thank you,
Michal

Mime
View raw message