Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 197D0CE72 for ; Thu, 4 Jul 2013 10:20:52 +0000 (UTC) Received: (qmail 4052 invoked by uid 500); 4 Jul 2013 10:20:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 3973 invoked by uid 500); 4 Jul 2013 10:20:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 3965 invoked by uid 99); 4 Jul 2013 10:20:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jul 2013 10:20:45 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of evan@evalicious.com designates 209.85.217.177 as permitted sender) Received: from [209.85.217.177] (HELO mail-lb0-f177.google.com) (209.85.217.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jul 2013 10:20:41 +0000 Received: by mail-lb0-f177.google.com with SMTP id 10so1048443lbf.22 for ; Thu, 04 Jul 2013 03:20:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :content-transfer-encoding:x-gm-message-state; bh=5URUfFlMT/BvnBdfM8m4Ps3UsvkqxntB9zr/wXaur1w=; b=NSrLR30mYnZ9aq4/uYSHt/K4zvhcAQHCGfG5cKz7HWdkEJ2Rnrp6Z/N/mk0UeLhX/e 01gq31g0d1LzRin2Qcs7wVX2yRBGfsePqJvVZtp4qtTT4C7lJmiXw9MUsi7O4tBfMdCy B4nNOlk4LPi6+vIUIV1s6WvB41PiB7E6MTXH7Q4RyF9i7Y6Gbbbf6pvThSBB79xPKhMO mpWOr6yeA+zVGk8L/l8mk2ZCHnrjW9JQuBNiKsraOq39c300fu3W/4SxnpVVWAMRbQUP LBlvlq7t5JiynArXealJdDUASfXb7QaGGNddoS33grcOR2mp5SFveIXbh5oOnwCpQdcQ 28Eg== MIME-Version: 1.0 X-Received: by 10.152.21.99 with SMTP id u3mr2591170lae.18.1372933219312; Thu, 04 Jul 2013 03:20:19 -0700 (PDT) Received: by 10.114.160.109 with HTTP; Thu, 4 Jul 2013 03:20:19 -0700 (PDT) Date: Thu, 4 Jul 2013 11:20:19 +0100 Message-ID: Subject: going down from RF=3 to RF=2, repair constantly falls over with JVM OOM From: Evan Dandrea To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQmIGjDeOXSImuTy3PiDtgW2Zdn/KzVMfWGJXhXflBBIYR7Rzsb7bT3+MOeo0XC888ZtQu0g X-Virus-Checked: Checked by ClamAV on apache.org Hi, We've made the mistake of letting our nodes get too large, now holding about 3TB each. We ran out of enough free space to have a successful compaction, and because we're on 1.0.7, enabling compression to get out of the mess wasn't feasible. We tried adding another node, but we think this may have put too much pressure on the existing ones it was replicating from, so we backed out. So we decided to drop RF down to 2 from 3 to relieve the disk pressure and started building a secondary cluster with lots of 1 TB nodes. We ran repair -pr on each node, but it=E2=80=99s failing with a JVM OOM on one node while another node is streaming from it for the final repair. Does anyone know what we can tune to get the cluster stable enough to put it in a multi-dc setup with the secondary cluster? Do we actually need to wait for these RF3->RF2 repairs to stabilize, or could we point it at the secondary cluster without worry of data loss? We=E2=80=99ve set the heap on these two problematic nodes to 20GB, up from = the equally too high 12GB, but we=E2=80=99re still hitting OOM. I had seen in other threads that tuning down compaction might help, so we=E2=80=99re tryi= ng the following: in_memory_compaction_limit_in_mb 32 (down from 64) compaction_throughput_mb_per_sec 8 (down from 16) concurrent_compactors 2 (the nodes have 24 cores) flush_largest_memtables_at 0.45 (down from 0.50) stream_throughput_outbound_megabits_per_sec 300 (down from 400) reduce_cache_sizes_at 0.5 (down from 0.6) reduce_cache_capacity_to 0.35 (down from 0.4) -XX:CMSInitiatingOccupancyFraction=3D30 Here=E2=80=99s the log from the most recent repair failure: http://paste.ubuntu.com/5843017/ The OOM starts at line 13401. Thanks for whatever insight you can provide.