cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <jak...@gmail.com>
Subject Re: SEVERE Data Corruption Problems
Date Fri, 11 Feb 2011 01:30:00 GMT
Can you show us sstable listing names? should be *-f-Data.db

On Thu, Feb 10, 2011 at 7:18 PM, Dan Hendry <dan.hendry.junk@gmail.com>wrote:

> Upgraded one node to 0.7. Its logging exceptions like mad (thousands per
> minute). All like below (which is fairly new to me):
>
> ERROR [ReadStage:721] 2011-02-10 18:13:56,190 AbstractCassandraDaemon.java
> (line 114) Fatal exception in thread Threa
> d[ReadStage:721,5,main]
> java.io.IOError: java.io.EOFException
>        at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa
> mesIterator.java:75)
>        at
>
> org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(Nam
> esQueryFilter.java:59)
>        at
>
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFil
> ter.java:80)
>        at
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilySto
> re.java:1275)
>        at
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
> java:1167)
>        at
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
> java:1095)
>        at org.apache.cassandra.db.Table.getRow(Table.java:384)
>        at
>
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadComma
> nd.java:60)
>        at
>
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(Stor
> ageProxy.java:473)
>        at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>        at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
> va:886)
>        at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
> 08)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException
>        at java.io.DataInputStream.readInt(DataInputStream.java:375)
>        at
>
> org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri
> alizer.java:48)
>        at
>
> org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSeri
> alizer.java:30)
>        at
>
> org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.
> java:108)
>        at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableName
> sIterator.java:106)
>        at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNa
> mesIterator.java:71)
>        ... 12 more
>
> Dan
>
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: February-09-11 18:14
> To: dev
> Subject: Re: SEVERE Data Corruption Problems
>
> Hi Dan,
>
> it would be very useful to test with 0.7 branch instead of 0.7.0 so at
> least you're not chasing known and fixed bugs like CASSANDRA-1992.
>
> As you say, there's a lot of people who aren't seeing this, so it
> would also be useful if you can provide some kind of test harness
> where you can say "point this at a cluster and within a few hours
>
> On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry <dan.hendry.junk@gmail.com>
> wrote:
> > I have been having SEVERE data corruption issues with SSTables in my
> > cluster, for one CF it was happening almost daily (I have since shut down
> > the service using that CF as it was too much work to manage the Cassandra
> > errors). At this point, I can’t see how it is anything but a Cassandra
> bug
> > yet it’s somewhat strange and very scary that I am the only one who seems
> to
> > be having such serious issues. Most of my data is indexed in two ways so
> I
> > have been able to write a validator which goes through and back fills
> > missing data but it’s kind of defeating the whole point of Cassandra. The
> > only way I have found to deal with issues when they crop up to prevent
> nodes
> > crashing from repeated failed compactions is delete the SSTable. My
> cluster
> > is running a slightly modified 0.7.0 version which logs what files errors
> > for so that I can stop the node and delete them.
> >
> >
> >
> > The problem:
> >
> > -          Reads, compactions and hinted handoff fail with various
> > exceptions (samples shown at the end of this email) which seem to
> indicate
> > sstable corruption.
> >
> > -          I have seen failed reads/compactions/hinted handoff on 4 out
> of
> 4
> > nodes (RF=2) for 3 different super column families and 1 standard column
> > family (4 out of 11) and just now, the Hints system CF. (if it matters
> the
> > ring has not changed since one CF which has been giving me trouble was
> > created). I have check SMART disk info and run various diagnostics and
> there
> > does not seem to be any hardware issues, plus what are the chances of all
> > four nodes having the same hardware problems at the same time when for
> all
> > other purposes, they appear fine?
> >
> > -          I have added logging which outputs what sstable are causing
> > exceptions to be thrown. The corrupt sstables have been both freshly
> flushed
> > memtables and the output of compaction (ie, 4 sstables which all seem to
> be
> > fine get compacted to 1 which is then corrupt). It seems that the
> majority
> > of corrupt sstables are post-compacted (vs post-memtable flush).
> >
> > -          The one CF which was giving me the most problems was heavily
> > written to (1000-1500 writes/second continually across the cluster). For
> > that cf, was having to deleting 4-6 sstables a day across the cluster
> (and
> > the number was going up, even the number of problems for remaining CFs is
> > going up). The other CFs which have had corrupt sstables are also quite
> > heavily written to (generally a few hundred writes a second across the
> > cluster).
> >
> > -          Most of the time (5/6 attempts) when this problem occurs,
> > sstable2json also fails. I have however, had one case where I was able to
> > export the sstable to json, then re-import it at which point I was no
> longer
> > seeing exceptions.
> >
> > -          The cluster has been running for a little over 2 months now,
> > problem seems to have sprung up in the last 3-4 weeks and seems to be
> > steadily getting worse.
> >
> >
> >
> > Ultimately, I think I am hitting some subtle race condition somewhere. I
> > have been starting to dig into the Cassandra code but I barely know where
> to
> > start looking. I realize I have not provided nearly enough information to
> > easily debug the problem but PLEASE keep your eyes open for possibly racy
> or
> > buggy code which could cause these sorts of problems. I am willing to
> > provided full Cassandra logs and a corrupt SSTable on an individual
> basis:
> > please email me and let me know.
> >
> >
> >
> > Here is possibly relevant information and my theories on a possible root
> > cause. Again, I know little about the Cassandra code base and have only
> > moderate java experience so these theories may be way off base.
> >
> > -          Strictly speaking, I probably don’t have enough memory for my
> > workload. I see stop the world gc occurring ~30/day/node, often causing
> > Cassandra to hang for 30+ seconds (according to the gc logs). Could there
> be
> > some java bug where a full gc in the middle of writing or flushing
> > (compaction/memtable flush) or doing some other disk based activity
> causes
> > some sort of data corruption?
> >
> > -          Writes are usually done at ConsistencyLevel ONE with
> additional
> > client side retry logic. Given that I often see consecutive nodes in the
> > ring down, could there be some edge condition where dying at just the
> right
> > time causes parts of mutations/messages to be lost?
> >
> > -          All of the CFs which have been causing me problems have large
> > rows which are compacted incrementally. Could there be some problem with
> the
> > incremental compaction logic?
> >
> > -          My cluster has a fairly heavy write load (again, the most
> > problematic CF is getting 1500 (w/s)/(RF=2) = 750 writes/second/node).
> > Furthermore, it is highly probable that there are timestamp collisions.
> > Could there be some issue with timestamp logic (ie, using > instead of >=
> or
> > some such) during flushes/compaction?
> >
> > -          Once a node
> >
> >
> >
> > Cluster/system information:
> >
> > -          4 nodes with RF=2
> >
> > -          Nodes have 8 cores with 24 GB of RAM a piece.
> >
> > -          2 HDs, 1 for commit log/system, 1 for /var/lib/cassandra/data
> >
> > -          OS is Ubuntu 10.04 (uname –r = 2.6.32-24-server)
> >
> > -          Java:
> >
> > o   java version "1.6.0_22"
> >
> > o   Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> >
> > o   Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)
> >
> > -          Slightly modified (file information in exceptions) version of
> > 0.7.0
> >
> >
> >
> > The following non-standard cassandra.yaml properties have been changed:
> >
> > -          commitlog_sync_period_in_ms: 100 (with commitlog_sync:
> periodic)
> >
> > -          disk_access_mode: mmap_index_only
> >
> > -          concurrent_reads: 12
> >
> > -          concurrent_writes: 2 (was 32, but I dropped it to 2 to try and
> > eliminate any mutation race conditions – did not seem to help)
> >
> > -          sliced_buffer_size_in_kb: 128
> >
> > -          in_memory_compaction_limit_in_mb: 50
> >
> > -          rpc_timeout_in_ms: 15000
> >
> >
> >
> > Schema for most problematic CF:
> >
> > name: DeviceEventsByDevice
> >
> > column_type: Standard
> >
> > memtable_throughput_in_mb: 150
> >
> > memtable_operations_in_millions: 1.5
> >
> > gc_grace_seconds: 172800
> >
> > keys_cached: 1000000
> >
> > rows_cached: 0
> >
> >
> >
> > Dan Hendry
> >
> > (403) 660-2297
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.872 / Virus Database: 271.1.1/3432 - Release Date: 02/09/11
> 02:34:00
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message