cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jose Fernandez (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11209) SSTable ancestor leaked reference
Date Tue, 23 Feb 2016 16:10:19 GMT


Jose Fernandez commented on CASSANDRA-11209:

This is the error on

ERROR 22:08:05 Cannot start multiple repair sessions over the same sstables
ERROR 22:08:05 Failed creating a merkle tree for [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee
on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]], /
(see log for details)
ERROR 22:08:05 Exception in thread Thread[ValidationExecutor:8,1,main]
java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
	at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(
	at org.apache.cassandra.db.compaction.CompactionManager.access$600(
	at org.apache.cassandra.db.compaction.CompactionManager$
	at ~[na:1.8.0_66]
	at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[na:1.8.0_66]
	at java.util.concurrent.ThreadPoolExecutor$ [na:1.8.0_66]
	at [na:1.8.0_66]

> SSTable ancestor leaked reference
> ---------------------------------
>                 Key: CASSANDRA-11209
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Jose Fernandez
>         Attachments: screenshot-1.png, screenshot-2.png
> We're running a fork of 2.1.13 that adds the TimeWindowCompactionStrategy from [~jjirsa].
We've been running 4 clusters without any issues for many months until a few weeks ago we
started scheduling incremental repairs every 24 hours (previously we didn't run any repairs
at all).
> Since then we started noticing big discrepancies in the LiveDiskSpaceUsed, TotalDiskSpaceUsed,
and actual size of files on disk. The numbers are brought back in sync by restarting the node.
We also noticed that when this bug happens there are several ancestors that don't get cleaned
up. A restart will queue up a lot of compactions that slowly eat away the ancestors.
> I looked at the code and noticed that we only decrease the LiveTotalDiskUsed metric in
the SSTableDeletingTask. Since we have no errors being logged, I'm assuming that for some
reason this task is not getting queued up. If I understand correctly this only happens when
the reference count for the SStable reaches 0. So this is leading us to believe that something
during repairs and/or compactions is causing a reference leak to the ancestor table.

This message was sent by Atlassian JIRA

View raw message