Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AE4918A27 for ; Tue, 23 Feb 2016 16:06:28 +0000 (UTC) Received: (qmail 31883 invoked by uid 500); 23 Feb 2016 16:06:18 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 31844 invoked by uid 500); 23 Feb 2016 16:06:18 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 31801 invoked by uid 99); 23 Feb 2016 16:06:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2016 16:06:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8F0E22C14E1 for ; Tue, 23 Feb 2016 16:06:18 +0000 (UTC) Date: Tue, 23 Feb 2016 16:06:18 +0000 (UTC) From: "Jose Fernandez (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-11209) SSTable ancestor leaked reference MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-11209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159096#comment-15159096 ] Jose Fernandez commented on CASSANDRA-11209: -------------------------------------------- Actually, I just spotted an error during repair: ERROR 22:08:05 [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee] session completed with the following error org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:166) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:415) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:134) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-2.1.13.jar:2.1.13] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_66] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66] ERROR 22:08:05 Repair session a85c9760-d9b0-11e5-9b9c-c12de94ec9ee for range (7686143364045646505,-6148914691236517207] failed with error org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_66] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_66] at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:3048) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-2.1.13.jar:2.1.13] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_66] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_66] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66] Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) [apache-cassandra-2.1.13.jar:2.1.13] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_66] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66] ... 1 common frames omitted Caused by: org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:166) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:415) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:134) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-2.1.13.jar:2.1.13] ... 3 common frames omitted ERROR 22:08:05 Exception in thread Thread[AntiEntropySessions:1,5,jolokia] java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) ~[apache-cassandra-2.1.13.jar:2.1.13] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_66] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_66] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66] Caused by: org.apache.cassandra.exceptions.RepairException: [repair #a85c9760-d9b0-11e5-9b9c-c12de94ec9ee on timeslice_store/minute_timeslice_blobs, (7686143364045646505,-6148914691236517207]] Validation failed in /10.1.29.31 at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:166) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:415) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:134) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) ~[apache-cassandra-2.1.13.jar:2.1.13] ... 3 common frames omitted > SSTable ancestor leaked reference > --------------------------------- > > Key: CASSANDRA-11209 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11209 > Project: Cassandra > Issue Type: Bug > Components: Compaction > Reporter: Jose Fernandez > Attachments: screenshot-1.png, screenshot-2.png > > > We're running a fork of 2.1.13 that adds the TimeWindowCompactionStrategy from [~jjirsa]. We've been running 4 clusters without any issues for many months until a few weeks ago we started scheduling incremental repairs every 24 hours (previously we didn't run any repairs at all). > Since then we started noticing big discrepancies in the LiveDiskSpaceUsed, TotalDiskSpaceUsed, and actual size of files on disk. The numbers are brought back in sync by restarting the node. We also noticed that when this bug happens there are several ancestors that don't get cleaned up. A restart will queue up a lot of compactions that slowly eat away the ancestors. > I looked at the code and noticed that we only decrease the LiveTotalDiskUsed metric in the SSTableDeletingTask. Since we have no errors being logged, I'm assuming that for some reason this task is not getting queued up. If I understand correctly this only happens when the reference count for the SStable reaches 0. So this is leading us to believe that something during repairs and/or compactions is causing a reference leak to the ancestor table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)