cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omid Aladini (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4321) stackoverflow building interval tree & possible sstable corruptions
Date Tue, 19 Jun 2012 14:29:44 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396804#comment-13396804
] 

Omid Aladini commented on CASSANDRA-4321:
-----------------------------------------

Tried v4 patch and offline scrub went through completely. Cassandra started without any error
but compaction halted again:

{code}
2012-06-19_14:01:03.47432  INFO 14:01:03,474 Compacting [SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-67792-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-65607-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-63279-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-65491-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-68332-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-64720-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-65322-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-66557-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-64504-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-68179-Data.db'),
SSTableReader(path='/var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-hd-65005-Data.db')]
2012-06-19_14:01:08.73528 ERROR 14:01:08,733 Exception in thread Thread[CompactionExecutor:11,1,main]
2012-06-19_14:01:08.73538 java.lang.RuntimeException: Last written key DecoratedKey(42351003983459534782466386414991462257,
313632303432347c3130303632313432) >= current key DecoratedKey(38276735926421753773204634663641518108,
31343638373735327c3439343837333932) writing into /var/lib/cassandra/abcd/data/SOMEKSP/CF3/SOMEKSP-CF3-tmp-hd-68399-Data.db
2012-06-19_14:01:08.73572 	at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
2012-06-19_14:01:08.73581 	at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
2012-06-19_14:01:08.73590 	at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
2012-06-19_14:01:08.73600 	at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
2012-06-19_14:01:08.73611 	at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
2012-06-19_14:01:08.73622 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
2012-06-19_14:01:08.73633 	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
2012-06-19_14:01:08.73642 	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
2012-06-19_14:01:08.73650 	at java.util.concurrent.FutureTask.run(Unknown Source)
2012-06-19_14:01:08.73657 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
2012-06-19_14:01:08.73665 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
2012-06-19_14:01:08.73672 	at java.lang.Thread.run(Unknown Source)
{code}

All SSTables that participated in compaction were new ones written by the offline scrub (according
their timestamp and also id range.) although the first one didn't exist any more (already
promoted before the exception?)

{quote}This is not really a new bug, but I believe that prior to CASSANDRA-4142, *this had
less consequences*.{quote}

Sylvain, could you please elaborate on this? I'd like to know how pre-1.1.1 data is affected
by the Range-vs-Bounds bug. Only overlapping/duplicate sstables on the same level leading
to slower reads caused by unneeded sstable lookups?

                
> stackoverflow building interval tree & possible sstable corruptions
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-4321
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4321
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.1
>            Reporter: Anton Winter
>            Assignee: Sylvain Lebresne
>             Fix For: 1.1.2
>
>         Attachments: 0001-Change-Range-Bounds-in-LeveledManifest.overlapping-v3.txt,
0002-Scrub-detects-and-repair-outOfOrder-rows-v3.txt, 0003-Create-standalone-scrub-v3.txt,
0003-Create-standalone-scrub-v4.txt, ooyala-hastur-stacktrace.txt
>
>
> After upgrading to 1.1.1 (from 1.1.0) I have started experiencing StackOverflowError's
resulting in compaction backlog and failure to restart. 
> The ring currently consists of 6 DC's and 22 nodes using LCS & compression.  This
issue was first noted on 2 nodes in one DC and then appears to have spread to various other
nodes in the other DC's.  
> When the first occurrence of this was found I restarted the instance but it failed to
start so I cleared its data and treated it as a replacement node for the token it was previously
responsible for.  This node successfully streamed all the relevant data back but failed again
a number of hours later with the same StackOverflowError and again was unable to restart.

> The initial stack overflow error on a running instance looks like this:
> ERROR [CompactionExecutor:314] 2012-06-07 09:59:43,017 AbstractCassandraDaemon.java (line
134) Exception in thread Thread[CompactionExecutor:314,1,main]
> java.lang.StackOverflowError
>         at java.util.Arrays.mergeSort(Arrays.java:1157)
>         at java.util.Arrays.sort(Arrays.java:1092)
>         at java.util.Collections.sort(Collections.java:134)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.findMinMedianMax(IntervalNode.java:114)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:49)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
> [snip - this repeats until stack overflow.  Compactions stop from this point onwards]
> I restarted this failing instance with DEBUG logging enabled and it throws the following
exception part way through startup:
> ERROR 11:37:51,046 Exception in thread Thread[OptionalTasks:1,5,main]
> java.lang.StackOverflowError
>         at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:307)
>         at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:276)
>         at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:230)
>         at org.slf4j.helpers.MessageFormatter.format(MessageFormatter.java:124)
>         at org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:228)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:45)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
> [snip - this repeats until stack overflow]
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalTree.<init>(IntervalTree.java:39)
>         at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:560)
>         at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:617)
>         at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:320)
>         at org.apache.cassandra.db.DataTracker.addInitialSSTables(DataTracker.java:259)
>         at org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:234)
>         at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:331)
>         at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:309)
>         at org.apache.cassandra.db.Table.initCf(Table.java:367)
>         at org.apache.cassandra.db.Table.<init>(Table.java:299)
>         at org.apache.cassandra.db.Table.open(Table.java:114)
>         at org.apache.cassandra.db.Table.open(Table.java:97)
>         at org.apache.cassandra.db.Table$2.apply(Table.java:574)
>         at org.apache.cassandra.db.Table$2.apply(Table.java:571)
>         at com.google.common.collect.Iterators$8.next(Iterators.java:751)
>         at org.apache.cassandra.db.ColumnFamilyStore.all(ColumnFamilyStore.java:1625)
>         at org.apache.cassandra.db.MeteredFlusher.countFlushingBytes(MeteredFlusher.java:118)
>         at org.apache.cassandra.db.MeteredFlusher.run(MeteredFlusher.java:45)
>         at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:79)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> DEBUG 11:37:51,052 Initializing ksU.cfS
> And then finally fails with the following:
> DEBUG 11:49:03,752 Creating IntervalNode from [Interval(DecoratedKey(104860264640932324846851821824650966808,
4fcc88eb0218216164673394), DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b)),
Interval(DecoratedKey(104860264640932324846851821824650966808, 4fcc88eb0218216164673394),
DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b)), Interval(DecoratedKey(104860264640932324846851821824650966808,
4fcc88eb0218216164673394), DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b)),
Interval(DecoratedKey(104860264640932324846851821824650966808, 4fcc88eb0218216164673394),
DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b)), Interval(DecoratedKey(104860264640932324846851821824650966808,
4fcc88eb0218216164673394), DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b)),
Interval(DecoratedKey(104860264640932324846851821824650966808, 4fcc88eb0218216164673394),
DecoratedKey(93975306025956344620001177071135439009, 4fc8fb042c98458c7a58bc3b))]
> java.lang.reflect.InvocationTargetException
> DEBUG 11:49:03,753 Configured datacenter replicas are dc1:2, dc2:2, dc3:2, dc4:2, dc5:0,
dc6:2, dc7:0, dc8:0, dc9:2
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:616)
>         at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:160)
> Caused by: java.lang.StackOverflowError
>         at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:307)
>         at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:276)
>         at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:230)
>         at org.slf4j.helpers.MessageFormatter.format(MessageFormatter.java:124)
>         at org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:228)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:45)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
> [snip - this repeats until stack overflow]
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:64)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalNode.<init>(IntervalNode.java:62)
>         at org.apache.cassandra.utils.IntervalTree.IntervalTree.<init>(IntervalTree.java:39)
>         at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:560)
>         at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:617)
>         at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:320)
>         at org.apache.cassandra.db.DataTracker.addInitialSSTables(DataTracker.java:259)
>         at org.apache.cassandra.db.ColumnFamilyStore.<init>(ColumnFamilyStore.java:234)
>         at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:331)
>         at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:309)
>         at org.apache.cassandra.db.Table.initCf(Table.java:367)
>         at org.apache.cassandra.db.Table.<init>(Table.java:299)
>         at org.apache.cassandra.db.Table.open(Table.java:114)
>         at org.apache.cassandra.db.Table.open(Table.java:97)
>         at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:204)
>         at org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassandraDaemon.java:254)
>         ... 5 more
> Cannot load daemon
> Service exit with a return value of 3
> Running with assertions enabled allows me to start the instance but when doing so I get
errors such as:
> ERROR 01:22:22,753 Exception in thread Thread[SSTableBatchOpen:2,5,main]java.lang.AssertionError:
SSTable first key DecoratedKey(100294972947100949193477090306072672386, 4fcf051ef5067d7f17d9fc35)
> last key DecoratedKey(90250429663386465697464050082134975058, 4fce996e3c1eed8c4b17dd66)
> at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:412)
> at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:187)
> at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:225)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:636)
> and:
> ERROR 01:27:58,946 Exception in thread Thread[CompactionExecutor:9,1,main]
> java.lang.AssertionError: Last written key DecoratedKey(81958437188197992567937826278457419048,
4fa1aebad23f81e4321d344d) >= current key DecoratedKey(64546479828744423263742604083767363606,
4fcafc0f19f6a8092d4d4f94) writing into /var/lib/XX/data/cassandra/ks1/cf1/ks1-cf1-tmp-hd-657317-Data.db
>         at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134)
>         at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153)
>         at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
>         at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
>         at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:150)
>         at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> Just like the initial errors compactions appear to stop occurring after this point. 

> Given the above this looks like sstables are getting corrupted.  By restarting nodes
I am able to identify several hundred sstables exhibiting the same problem and this appears
to be growing.
> I have tried scrubbing those affected nodes but the problem continues to occur.  If this
is due to sstable corruptions is there another way of validating sstables for correctness?
 Given that it has spread to various servers in other DC's it looks like this is directly
related to the 1.1.1 upgrade recently performed on the ring.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message