cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9021) AssertionError and Leak detected during sstable compaction
Date Tue, 24 Mar 2015 15:33:53 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378030#comment-14378030
] 

Benedict commented on CASSANDRA-9021:
-------------------------------------

As it happens, test-compression was already exhibiting the bug, we just aren't monitoring
it closely since not all of the compression tests pass, and it's only running under "test-all"
which we seem to still ignore. I would prefer to stick to this as the route to track this
bug, since it feels a bit too specific to have its own unit test (it's an upshot of the code
history that preceded it, which cannot proceed it, and the ordering we'd be testing for are
specific to the current status quo - testing for a historical accident rather than a general
functional failure seems false confidence inducing) whereas the tests that already exhibit
it are more general functional tests that should exhibit any such misordering, not just this
one. It also wouldn't be quite so easy to exhibit as stands, since it was depending on reallocation
of the memory location to corrupt the state and trigger the problem. We could insert a burn
test for sstable compaction, ensuring they behave correctly at all times for their duration.
I would be in favour of this, as it would put all of these paths more thoroughly through their
paces, but I think that warrants a separate endeavour.

There was an interesting question as to why it was relying on memory corruption and not asserting
the memory had been freed more aggressively, which I've tracked down to SafeMemory not actually
zeroing out its peer, since the free is performed by its internal Tidy. However there is no
reason to not immediately zero the peer on closing a SafeMemory, since any access of a closed
SafeMemory is bad, whether or not the shared memory allocation is still live. So I'm attaching
another micro patch that would change this behaviour, and prints much more helpful error messages
when running test-compression.

> AssertionError and Leak detected during sstable compaction
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-9021
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9021
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cluster setup:
> - 20-node stress cluster, GCE n1-standard-2
> - 10-node receiver cluster ingesting data, GCE n1-standard-8 
> Platform:
> - Ubuntu 12.0.4 x86_64
> Versions:
> - DSE 4.7.0
> - Cassandra 2.1.3.304
> - Java 1.7.0_45
> DSE Configuration:
> - Xms7540M 
> - Xmx7540M 
> - Xmn800M
> - Ddse.system_cpu_cores=8 -Ddse.system_memory_in_mb=30161 
> - Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader 
> - ea -javaagent:/usr/local/lib/dse/resources/cassandra/lib/jamm-0.3.0.jar 
> - XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> - XX:ThreadPriorityPolicy=42 -Xms7540M -Xmx7540M -Xmn800M 
> - XX:+HeapDumpOnOutOfMemoryError -Xss256k 
> - XX:StringTableSize=1000003 -XX:+UseParNewGC 
> - XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> - XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> - XX:CMSInitiatingOccupancyFraction=75 
> - XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
>            Reporter: Rocco Varela
>            Assignee: Benedict
>             Fix For: 2.1.4
>
>         Attachments: 9021-improvereporting.txt, 9021.txt, system.log
>
>
> After ~3 hours of data ingestion we see assertion errors and 'LEAK DETECTED' errors during
what looks like sstable compaction.
> system.log snippets (full log attached):
> {code}
> ...
> INFO  [CompactionExecutor:12] 2015-03-23 02:45:51,770  CompactionTask.java:267 - Compacted
4 sstables to [/mnt/cass_data_disks/data1/requests_ks/timeline-       9500fe40d0f611e495675d5ea01541b5/requests_ks-timeline-ka-185,].
 65,916,594 bytes to 66,159,512 (~100% of original) in 26,554ms = 2.376087MB/s.  983 total
      partitions merged to 805.  Partition merge counts were {1:627, 2:178, }
> INFO  [CompactionExecutor:11] 2015-03-23 02:45:51,837  CompactionTask.java:267 - Compacted
4 sstables to [/mnt/cass_data_disks/data1/system/                     compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-ka-119,].
 426 bytes to 42 (~9% of original) in 82ms = 0.000488MB/s.  5  total partitions merged to
1.  Partition merge counts were {1:1, 2:2, }
> ERROR [NonPeriodicTasks:1] 2015-03-23 02:45:52,251  CassandraDaemon.java:167 - Exception
in thread Thread[NonPeriodicTasks:1,5,main]
> java.lang.AssertionError: null
>  at org.apache.cassandra.io.compress.CompressionMetadata$Chunk.<init>(CompressionMetadata.java:438)
~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:228)
~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.dropPageCache(CompressedPoolingSegmentedFile.java:80)
~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:923) ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2036)
~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_45]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_45]
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
~[na:1.7.0_45]
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
~[na:1.7.0_45]
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_45]
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
>  at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> ...
> INFO  [MemtableFlushWriter:50] 2015-03-23 02:47:29,465  Memtable.java:378 - Completed
flushing /mnt/cass_data_disks/data1/requests_ks/timeline-                  9500fe40d0f611e495675d5ea01541b5/requests_ks-timeline-ka-188-Data.db
(16311981 bytes) for commitlog position ReplayPosition(segmentId=1427071574495,          
  position=4523631)
> ERROR [Reference-Reaper:1] 2015-03-23 02:47:33,987  Ref.java:181 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@2f59b10) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@1251424500:/mnt/cass_data_disks/data1/requests_ks/timeline-9500fe40d0f611e495675d5ea01541b5/
   requests_ks-timeline-ka-149 was not released before the reference was garbage collected
> INFO  [Service Thread] 2015-03-23 02:47:40,158  GCInspector.java:142 - ConcurrentMarkSweep
GC in 12247ms.  CMS Old Gen: 5318987136 -> 457655168; CMS Perm Gen:   44731264 -> 44699160;
Par Eden Space: 8597912 -> 418006664; Par Survivor Space: 71865728 -> 59679584
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message