Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9BF80F782 for ; Mon, 29 Apr 2013 17:34:09 +0000 (UTC) Received: (qmail 74507 invoked by uid 500); 29 Apr 2013 17:34:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74462 invoked by uid 500); 29 Apr 2013 17:34:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74453 invoked by uid 99); 29 Apr 2013 17:34:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 17:34:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.216.182] (HELO mail-qc0-f182.google.com) (209.85.216.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2013 17:34:01 +0000 Received: by mail-qc0-f182.google.com with SMTP id k19so3317539qcs.27 for ; Mon, 29 Apr 2013 10:33:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:message-id:date:from:reply-to:user-agent:mime-version:to :subject:content-type:content-transfer-encoding:x-gm-message-state; bh=bYGk+53y7ABIS2SkjnemtzGTvBXXAU58rSIJfLs7IzA=; b=ASs0dktV9NYdQQvFxT207CURgLMUSKlHDIpbgBqryexqCQi5CMNaIbTtbFXCoOOqzG GadZGogFuc7PiXZMQ2fuPdPsJBzN5G6/yTMcnm+sPR8N4LBdf1CaQNDsQlkI33kTTPW9 3HBxeS1ycxq/GHB/t6qjG4Wlzf6a4NIZZKu7EfSf44qUTcZykzNhf/FYV27rMbAu4DXa 56q3OH2pXU8veu1d9WCbWAfdl/H4Lxrbp9/DeuLYBhFxGKwVxYxAQbp0wv7EzcLfHenj 51WEYLn5XZKAFaU2FccLCsrFbAQ6dqN99Qsu4WG8n4cU5Sd8sBS3Lr9PBkhAy63CpuY0 8knw== X-Received: by 10.229.132.10 with SMTP id z10mr6059737qcs.103.1367256800661; Mon, 29 Apr 2013 10:33:20 -0700 (PDT) Received: from carrier.drewzhrodague (c-71-206-217-47.hsd1.pa.comcast.net. [71.206.217.47]) by mx.google.com with ESMTPSA id en8sm34710512qeb.0.2013.04.29.10.33.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 29 Apr 2013 10:33:19 -0700 (PDT) Message-ID: <517EAEDD.9020603@zhrodague.net> Date: Mon, 29 Apr 2013 13:33:17 -0400 From: Drew from Zhrodague Reply-To: drew@zhrodague.net User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:21.0) Gecko/20100101 Thunderbird/21.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Compaction, Slow Ring, and bad behavior Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQmZAtjnq3Swxj++hipup2i6NoHQf6JkyT5p/OHFKHAVDCKikpwkqv1GAIkctkfSlVagu16a X-Virus-Checked: Checked by ClamAV on apache.org Hi, we have a 9-node ring on m1.xlarge AWS hosts. We started having some trouble a while ago, and it's making me pull out all of my hair. The host in position #3 has been replaced 4 times. Each time, the host joins the ring, I do a nodetool repair -pr, and she seems fine for about a day. Then she gets real slow, sometimes OOMs, sometimes takes down the host in position #5, sometimes gets stuck on a compaction with near-idle disk throughput, and eventually dies without any kind of error message or reason for failing. Sometimes our cluster gets so slow that it is almost unusable - we get timeout errors from our application, AWS sends us voluminous alerts about latency. I've tried changing the amount of RAM between 8G and 12G, changing the MAX_HEAP_SIZE and HEAP_NEWSIZE, repeatedly forcing a stop compaction, setting astronomical ulimit values, and praying to available gods. I'm a bit confused. We're not using super-wide rows, most things are default. EL5, Cassandra 1.1.9, Java 1.6.0 -- Drew from Zhrodague lolcat divinator drew@zhrodague.net