Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 47964 invoked from network); 21 Apr 2010 17:46:37 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Apr 2010 17:46:37 -0000 Received: (qmail 68470 invoked by uid 500); 21 Apr 2010 17:46:36 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 68429 invoked by uid 500); 21 Apr 2010 17:46:36 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 68421 invoked by uid 99); 21 Apr 2010 17:46:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 17:46:36 +0000 X-ASF-Spam-Status: No, hits=-3.8 required=10.0 tests=AWL,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [131.215.239.119] (HELO mail.alumni.caltech.edu) (131.215.239.119) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 17:46:30 +0000 Received: from localhost (dsl081-082-089.lax1.dsl.speakeasy.net [64.81.82.89]) by mail.alumni.caltech.edu (Postfix) with ESMTPSA id 90A253F0DAC; Wed, 21 Apr 2010 10:45:49 -0700 (PDT) X-DKIM: Sendmail DKIM Filter v2.8.2 mail.alumni.caltech.edu 90A253F0DAC DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=alumni.caltech.edu; s=enforce; t=1271871951; bh=Csbfj1HhZlzRmlGynjfCrDM4izX6roo0NnN/wRwDZj4=; h=Date:From:To:Subject:Message-ID:References:Mime-Version: Content-Type:Content-Transfer-Encoding:In-Reply-To; b=ZdcmmqcoYy1WXtREQBTnETY9to94CQefEnzxtWdgpfs+1VWDb91+vFimvRTfEVBCb FMKmLyxlhn1mFMxMFUNNzpm749CUxxYvm1lvod/tpsL/QXjAZ+Z+3jfqJY3K+2ObpN 6de/uqhAKYfy1D1pfT38DBfv4olCp7DCXhkr1hvE= Date: Wed, 21 Apr 2010 10:45:47 -0700 From: Anthony Molinaro To: user@cassandra.apache.org Subject: Re: Cassandra 0.5.1 restarts slow Message-ID: <20100421174547.GC39306@alumni.caltech.edu> Mail-Followup-To: user@cassandra.apache.org References: <20100420215724.GA35209@alumni.caltech.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.4.2.3i X-MailScanner-Information-Alumni: X-Alumni-MailScanner-ID: 90A253F0DAC.ABE83 X-MailScanner-Alumni: No Virii found X-Spam-Status-Alumni: not spam, SpamAssassin (not cached, score=-3.268, required 5, ALL_TRUSTED -1.80, BAYES_00 -2.60, DNS_FROM_OPENWHOIS 1.13, FH_DATE_PAST_20XX 0.00) X-MailScanner-From: anthonym@alumni.caltech.edu On Wed, Apr 21, 2010 at 12:21:31PM -0500, Jonathan Ellis wrote: > [moving to user@] > > 0.6 fixes replaying faster than it can flush. Yeah, I noticed some of those fixes, and will probably take the leap into 0.6 if I can keep my cluster running (it's not doing too bad, I do about 400K reads and 250K writes per minute spread over 23 nodes), however some of the m1.large instances get into this backed up state frequently. So I need to keep the cluster running first. > as for why it backs up in the first place before the restart, you can > either (a) throttle writes [set your timeout lower, make your clients > back off temporarily when it gets a timeoutexception] What timeout is this? Something in the thrift API or a cassandra configuration? > or (b) add capacity. (b) is recommended. Yeah I've been doing that adding xlarge instances with raid0 disks which work better, but I keep running into issues with the old instances which hold up this work. I'll keep chugging along and hopefully get things sorted. -Anthony > > https://issues.apache.org/jira/browse/CASSANDRA-685 will mitigate this > but there is still no substitute for adding capacity to match demand. > > On Tue, Apr 20, 2010 at 4:57 PM, Anthony Molinaro > wrote: > > Hi, > > > > �I have a cassandra cluster where a couple things are happening. �Every > > once in a while a node will start to get backed up. �Checking tpstats I > > see a very large value for ROW-MUTATION-STAGE. �Sometimes it will be able > > to clear it if I give it enough time, other times the vm OOMs. �With some > > nodes I also see this happen during restarts, I'll restart and have to > > wait 6-12 hours for the node to not be marked as 'Down'. > > I've seen > > http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts > > and ended up with the following settings. > > > > KeysCachedFraction � � � � � �: 0.01 > > MemtableSizeInMB � � � � � � �: 100 > > MemtableObjectCountInMillions : 0.5 > > Heap � � � � � � � � � � � � �: -Xmx5G > > > > I only have 2 CFs in this instance and entries are small so in most cases > > I hit MemtableObjectCountInMillions first and total MemtableSizeInMB is > > about 60MB-120MB for the 2 CFs combined. > > > > Anyone have any pointers on where to look next? �These are m1.large EC2 > > instances (I want to move to xlarge to get more memory, but haven't yet > > gotten clarification on the best process for node replacement, per my > > other thread). > > > > Thanks, > > > > -Anthony > > > > -- > > ------------------------------------------------------------------------ > > Anthony Molinaro � � � � � � � � � � � � � > > -- ------------------------------------------------------------------------ Anthony Molinaro