Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 58099 invoked from network); 27 May 2010 17:41:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 May 2010 17:41:12 -0000 Received: (qmail 49791 invoked by uid 500); 27 May 2010 17:41:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49745 invoked by uid 500); 27 May 2010 17:41:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49737 invoked by uid 99); 27 May 2010 17:41:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 17:41:11 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [131.215.239.119] (HELO mail.alumni.caltech.edu) (131.215.239.119) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 17:41:04 +0000 Received: from localhost (dsl081-082-089.lax1.dsl.speakeasy.net [64.81.82.89]) by mail.alumni.caltech.edu (Postfix) with ESMTPSA id D960D3F3061; Thu, 27 May 2010 10:04:10 -0700 (PDT) X-DKIM: Sendmail DKIM Filter v2.8.2 mail.alumni.caltech.edu D960D3F3061 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=alumni.caltech.edu; s=enforce; t=1274979852; bh=WXhdPNBWhxoiKvUsLbY42aMVDhPJglHEGuC3dLZe2xw=; h=Date:From:To:Subject:Message-ID:References:Mime-Version: Content-Type:Content-Transfer-Encoding:In-Reply-To; b=IGMXShQGZGpJ/y3bQN1ZEF6kN1oNuvd8FRriAkSZbGITuKcbHl3XbELHsJhNEbstL m3mlF/2nv6h58As7ZkzI95SY2RnOCa66j4xTXEzq7jXCsHMAGPi9fCfvrduUZo8s79 fNh2ehsDHa/AAwfxwIid8Xh9qgYyqvQPamh9hr0E= Date: Thu, 27 May 2010 10:03:05 -0700 From: Anthony Molinaro To: user@cassandra.apache.org Subject: Re: GMFD messages Message-ID: <20100527170305.GA67655@alumni.caltech.edu> Mail-Followup-To: user@cassandra.apache.org References: <20100525233338.GA57292@alumni.caltech.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.4.2.3i X-MailScanner-Information-Alumni: X-Alumni-MailScanner-ID: D960D3F3061.AC3BB X-MailScanner-Alumni: No Virii found X-Spam-Status-Alumni: not spam, SpamAssassin (not cached, score=-3.268, required 5, ALL_TRUSTED -1.80, BAYES_00 -2.60, DNS_FROM_OPENWHOIS 1.13, FH_DATE_PAST_20XX 0.00) X-MailScanner-From: anthonym@alumni.caltech.edu X-Virus-Checked: Checked by ClamAV on apache.org On Thu, May 27, 2010 at 08:04:18AM -0600, Jonathan Ellis wrote: > This is a relic of when Gossip was over UDP and had to worry about > packet size. I created > https://issues.apache.org/jira/browse/CASSANDRA-1138 to remove those > notifications. Ahh, okay, well its odd that a limit was set even with UDP. I send large UDP packets all the time with LWES and don't have many issues, but glad to hear it will be fixed (I may patch locally a larger packet size as a short term workaround). Looking at the code it seems like if you hit either of these notifications the message is not serialized (ie serialize calls return false), would this explain why if I restart a machine in the cluster in this state it only sees some of the ring? In other words maybe with a fresh restart of everything, there is some part of the serialized message which is small enough that all 27 machines can be in there, however, once they've been running for a little bit they start to creep over the limit, then suddenly gossiping starts to fail as responses from some nodes are never sent, and I start seeing inconsistency in the rings? I think this hypothesis could be tested by just increasing the MAX size so I think I will try that. > I think the correlation with MessageDeserializer is a red herring. > Gossip only happens once per second so I don't see how that could back > MD up. Yeah, I couldn't see either, just the 'Stopping deserialization' message made me think it might (as only the nodes with a backed up MessageDeserializer had that message). Do gossip messages flow through the MessageDeserializer? Thanks for the response, -Anthony > On Tue, May 25, 2010 at 5:33 PM, Anthony Molinaro > wrote: > > Hi, > > > > �I just noticed I have lots of these messages > > > > INFO [GMFD:1] 2010-05-25 23:21:04,070 GossipDigestSynMessage.java (line 152) > > �Remaining bytes zero. Stopping deserialization in EndPointState. > > INFO [GMFD:1] 2010-05-25 23:21:05,224 GossipDigestSynMessage.java (line 129) > > �@@@@ Breaking out to respect the MTU size in EPS. Estimate is 56 @@@@ > > > > The first message only occurs on some machines in my cluster. �The second > > on all of them. > > > > The ones with the first message seem to be building up quite a backlog > > in their MessageDeserializer PendingTasks. > > > > I assume there is a correlation, what could be causing this sort of thing? > > > > This cluster is now at 27 m1.xlarge boxes on ec2 running 0.6.2 of some flavor. > > > > Thanks, > > > > -Anthony > > > > -- > > ------------------------------------------------------------------------ > > Anthony Molinaro � � � � � � � � � � � � � > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com -- ------------------------------------------------------------------------ Anthony Molinaro