Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 13176 invoked from network); 17 Feb 2011 15:44:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Feb 2011 15:44:30 -0000 Received: (qmail 36959 invoked by uid 500); 17 Feb 2011 15:44:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 36741 invoked by uid 500); 17 Feb 2011 15:44:25 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 36733 invoked by uid 99); 17 Feb 2011 15:44:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 15:44:25 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dan.hendry.junk@gmail.com designates 209.85.216.44 as permitted sender) Received: from [209.85.216.44] (HELO mail-qw0-f44.google.com) (209.85.216.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Feb 2011 15:44:18 +0000 Received: by qwi2 with SMTP id 2so2718400qwi.31 for ; Thu, 17 Feb 2011 07:43:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:to:references:in-reply-to:subject:date :message-id:mime-version:content-type:content-transfer-encoding :x-mailer:thread-index:content-language; bh=T8H8bMd79zQObiyFylYAyLrN+xl7cyf3ZYRy1LsF2Uw=; b=m+3YHHFTro/xBZoNCHmcwjfzjZ7EikHEIGsCmQvXybIbNEoUGK+DqW4dMRVawYQRzG j9XxXYVEeNMBzB3FOfX2JYvAabYg71uDkVNnzqPCl8nXiPAclwqgI8GyM2eVgrKC6qI+ ThQ4MLOZziS0o4AUOeMKTVf8QqxvUNgJNf1sk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-type:content-transfer-encoding:x-mailer:thread-index :content-language; b=Lj0DaMmNNRlh6nAKeaW/t/nlwkfaacu+DuPPAMkBaBDh7Eqrzcc85/CyTSrzvJddeT o4bQ9/riJNd/IyLNW+lVDUiGiHVSQoEH6SkxHZ3D03DQ8gHI9uO97oRGiz2gjv/w5MN+ 4S1rvePkItNoJ4QunzF/p7X/WDJzt4RuCGiUQ= Received: by 10.224.210.70 with SMTP id gj6mr497665qab.123.1297957437818; Thu, 17 Feb 2011 07:43:57 -0800 (PST) Received: from DHTABLET (kik.client.acceleratorcentre.net [38.121.79.182]) by mx.google.com with ESMTPS id h20sm759068qck.24.2011.02.17.07.43.56 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 Feb 2011 07:43:57 -0800 (PST) From: "Dan Hendry" To: References: In-Reply-To: Subject: RE: frequent client exceptions on 0.7.0 Date: Thu, 17 Feb 2011 10:43:48 -0500 Message-ID: <4d5d423d.d44de50a.18a3.2838@mx.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcvOg7c2gqlwJgnmTOaaMhpgZQD7IwAMGR/A Content-Language: en-ca X-Virus-Checked: Checked by ClamAV on apache.org Try turning on GC logging in Cassandra-env.sh, specifically: -XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/cassandra/gc.log Look for things like: "Total time for which application threads were stopped: 52.8795600 seconds". Anything over about a few seconds may be causing your problem. Stop the world GC is a real pain. In my cluster I was, and still am to some extent, seeing each node go 'down' about 10-30 times a day and up to a few hundred when running major compactions (by greping through the Cassandra system log). GC tuning is an art into itself but if this is your problem, try: - lower memtable flush thresholds - reduce new gen size (which is explicitly set in 0.7.1+, the -Xmn setting) - reducing CMSInitiatingOccupancyFraction from 75 to 60 or so (maybe less) - set -XX:ParallelGCThreads= - set -XX:ParallelCMSThreads= Again, I would recommend you do some more research into GC tuning (http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html is a good place to start). Most of my recommendations above will probably reduce the chance of your nodes going 'down' but may have pretty severe negative performance impacts. In my cluster, I found the measures needed to ensure the node never (or rarely, it cant be completely prevented) went down just were not worth it. I have ended up running the nodes closer to the wire and living with an increased rate of client side exceptions and nodes going down for short periods. Dan -----Original Message----- From: Andy Skalet [mailto:aeskalet@bitjug.com] Sent: February-17-11 4:18 To: Peter Schuller Cc: user@cassandra.apache.org Subject: Re: frequent client exceptions on 0.7.0 On Thu, Feb 17, 2011 at 12:37 AM, Peter Schuller wrote: > Bottom line: Check /var/log/cassandra/system.log to begin with and see > if it's reporting anything or being restarted. Thanks, Peter. In the system.log, I see quite a few of these across several machines. Everything else in the log is INFO level. WARN [ScheduledTasks:1] 2011-02-17 07:19:47,491 MessagingService.java (line 545) Dropped 182 READ messages in the last 5000ms WARN [ScheduledTasks:1] 2011-02-17 08:10:06,142 MessagingService.java (line 545) Dropped 31 READ messages in the last 5000ms WARN [ScheduledTasks:1] 2011-02-17 08:11:12,237 MessagingService.java (line 545) Dropped 54 READ messages in the last 5000ms WARN [ScheduledTasks:1] 2011-02-17 08:11:17,392 MessagingService.java (line 545) Dropped 487 READ messages in the last 5000ms The machines are in EC2 with firewall permission to talk to each other, so while not the most solid of network environments, at least pretty common these days. System is not going down, and cassandra process is not dying. Andy No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3447 - Release Date: 02/16/11 02:34:00