From user-return-10348-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Sat Nov 06 21:52:09 2010 Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 91439 invoked from network); 6 Nov 2010 21:52:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Nov 2010 21:52:09 -0000 Received: (qmail 66663 invoked by uid 500); 6 Nov 2010 21:52:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 66617 invoked by uid 500); 6 Nov 2010 21:52:38 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 66609 invoked by uid 99); 6 Nov 2010 21:52:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Nov 2010 21:52:38 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rev.chip@gmail.com designates 209.85.210.44 as permitted sender) Received: from [209.85.210.44] (HELO mail-pz0-f44.google.com) (209.85.210.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Nov 2010 21:52:28 +0000 Received: by pzk4 with SMTP id 4so818243pzk.31 for ; Sat, 06 Nov 2010 14:52:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=HaKkbbKepwyoxQWvYptY+c6VSOKl9wdp2Pn+Y7SyoeU=; b=uZiRHo8rQAw8/nc+sK3ITm0Njbo4SQrFY+q9Bk6hDRDEnDMcWbvvGlI23t7wNxC5j6 6XF1Rg1AmlnQohaY0nwAwUc30qmyaEGiKxga62r/MjYBbBb2vGWggD7bBfM96ZWFkdVX oHBpvHpOHT8pJwKc2huAV1JVR/zhnNnMIxOYA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=w+s79IdyFAPuWqDvrLqcZd6IHyTzoAFVFJHxdnI6mGOjDrmWPq9YBn1FZ/zOyMfZzl 66tGcHh6xOLQzbNZF49em70fnyxUVUPRYqQj9J5MHyNcKCuu7dGEFBXyPPfZ+KZySOE5 1LQu4Jole7Wi2BOd1AJ/uISVxeuHqN5qouW5o= Received: by 10.142.229.18 with SMTP id b18mr2658534wfh.414.1289080327600; Sat, 06 Nov 2010 14:52:07 -0700 (PDT) Received: from [192.168.11.104] ([76.232.4.207]) by mx.google.com with ESMTPS id p8sm4575503wff.16.2010.11.06.14.52.05 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 06 Nov 2010 14:52:06 -0700 (PDT) Message-ID: <4CD5CDFC.6020900@gmail.com> Date: Sat, 06 Nov 2010 14:51:56 -0700 From: Reverend Chip User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: node won't leave References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 11/6/2010 1:48 PM, Jonathan Ellis wrote: > On Fri, Nov 5, 2010 at 8:03 PM, Chip Salzenberg wrote: >> In the below "nodetool ring" output, machine 18 was told to loadbalance over >> an hour ago. It won't actually leave the ring. When I first told it to >> loadbalance, the cluster was under heavy write load; I've turned off the >> write load, but the node won't actually leave, still. Help? > What version is the cluster on? You mean, the Cassandra version? 0.7 beta3. > Did any of the nodes log any dropped messages? I didn't keep timestamps of the maintenance steps, so I will be unable to be sure which log entries correspond to which failure states. I did find dropped message log entries on node X.22, though. Here's the batch that happened more or less the time things went wrong: WARN [ScheduledTasks:1] 2010-11-05 17:15:03,294 MessagingService.java (line 515) Dropped 9122 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:05,434 MessagingService.java (line 515) Dropped 16658 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:07,084 MessagingService.java (line 515) Dropped 2167 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:09,371 MessagingService.java (line 515) Dropped 28011 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:11,111 MessagingService.java (line 515) Dropped 1139 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:13,330 MessagingService.java (line 515) Dropped 1203 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:15,241 MessagingService.java (line 515) Dropped 4494 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:16,925 MessagingService.java (line 515) Dropped 2277 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:18,839 MessagingService.java (line 515) Dropped 17376 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:23,385 MessagingService.java (line 515) Dropped 18714 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:25,261 MessagingService.java (line 515) Dropped 18952 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:29,006 MessagingService.java (line 515) Dropped 25137 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:30,859 MessagingService.java (line 515) Dropped 1 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:34,418 MessagingService.java (line 515) Dropped 2580 messages in the last 1000ms WARN [ScheduledTasks:1] 2010-11-05 17:15:35,816 MessagingService.java (line 515) Dropped 4317 messages in the last 1000ms I looked for similar messages on node X.21 but didn't find any. It seems that node states can become weird or wedged -- bordering on internally inconsistent -- and cleanup operations on the order of "shutdown the node manually and force-remove it from the ring" are commonplace. I hope I'm missing something. Am I to understand that ring maintenance requests can just fail when partially complete, in the same manner as a regular insert might fail, perhaps due to inter-node RPC overflow? > Any other error or warning messages? "Cannot provide an optimal BloomFilter" several times, and "Schema definitions were defined both locally and in cassandra.yaml" on startup. >> (It also collected 3.6G of load even though automatic bootstrapping is >> disabled -- but this node had belonged to the cluster before, so maybe >> cleaning out /var/lib/cassandra/* wasn't enough to prevent the node from >> rejoining and taking data responsibility?) > Assuming that contains both commitlog and data directories, that > should do it. You can tell by what it logs when it first starts up, > if it's asking other nodes to send it data. It would appear, then, that Cassandra isn't designed to be operated and understood without constant log watching of all nodes.