Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9C88C11F3C for ; Fri, 15 Aug 2014 04:46:20 +0000 (UTC) Received: (qmail 86884 invoked by uid 500); 15 Aug 2014 04:46:11 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86842 invoked by uid 500); 15 Aug 2014 04:46:11 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86832 invoked by uid 99); 15 Aug 2014 04:46:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Aug 2014 04:46:11 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: unknown (nike.apache.org: error in processing during lookup of peter@librato.com) Received: from [209.85.219.52] (HELO mail-oa0-f52.google.com) (209.85.219.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Aug 2014 04:45:44 +0000 Received: by mail-oa0-f52.google.com with SMTP id o6so1707752oag.25 for ; Thu, 14 Aug 2014 21:45:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=librato.com; s=google; h=mime-version:date:message-id:subject:from:to:content-type; bh=fwhF8KMw6bgyln1oLrHQMIUBV7FC4IwbrKY1ABUuFwU=; b=AO1sQDJYSgCUGDv0Bjq8OcvLpt0HX0K9fOZRTzUA4BKK9fzFxmWo7OfNxihGl9MPyp mqttEvtqc3Nl19cli12PYg7eJmFELEq4nC0yeFtY1r1eDhYXUbYmLapRhw6FDYyHZl/i 7sglXHAMlgwRS8LA/mzWvtRUGAkG5No1QkML0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=fwhF8KMw6bgyln1oLrHQMIUBV7FC4IwbrKY1ABUuFwU=; b=IMd1tgk9cQr1zcHw9q0RZuD4p5c8RkSfSL1Q8eILrTAPEPhXyin4W43/PthmKAMzWm Plw/UpIOjZj3AMySvVFHIP0ABpMqPSr3hkeIbgQXig+dnpjkz2m0QQnppBiE/qprbvgx AFPKOddyRpNhyxoEE+qm+j2KgJHq+6+qmugLevlotoQE+YDVuRQmd4DRBMTRDR/umwTj OEUNZOuQOpRFdgVLBLxmoZdx5K4z9+v5iuba8gaK50Bwc8yor+lk78ohHMh/b1tHq8Zr Z37mP8XYKS8D6lqPa1OeVVbTuRM0WIqaxDsnxV7mN8Ic5UQFzVLfgQ1yb20Jgw/T6hK9 x52A== X-Gm-Message-State: ALoCoQml6Sl55tcCq9DQyxFm59keKjkH3vVNPKdC99NJWS7RuF9Vq8UriH4NOyZSLAG/R9wQtD1I MIME-Version: 1.0 X-Received: by 10.60.115.133 with SMTP id jo5mr17863438oeb.39.1408077942074; Thu, 14 Aug 2014 21:45:42 -0700 (PDT) Received: by 10.76.88.39 with HTTP; Thu, 14 Aug 2014 21:45:42 -0700 (PDT) X-Originating-IP: [108.48.185.229] Date: Fri, 15 Aug 2014 00:45:42 -0400 Message-ID: Subject: Bootstrap failures: unable to find sufficient sources for streaming range From: Peter Haggerty To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org When adding nodes via bootstrap to a 27 node 2.0.9 cluster with a cluster-wide phi_convict_threshold of 12 the nodes fail to bootstrap. This worked a half dozen times in the past few weeks as we've scaled this cluster from 21 to 24 and then to 27 nodes. There have been no configuration or Cassandra version changes since the cluster was booted several weeks ago. This is adding a single node at a time just as we've done before. When using a phi of 16 on the node that is bootstrapping we've had one failure and one success. Adjusting RING_DELAY from 30 seconds to 45 doesn't appear to help. We have DEBUG level logs for a successful attempt and a failed attempt but they seem quite similar in behavior, even bad behavior like ERRORS reported by MigrationTask.java and FailureDetector.java. In the failed attempts the bootstrap immediately falls over with "unable to find ...": java.lang.IllegalStateException: unable to find sufficient sources for streaming range (1445230693272554509,1449885986247309687] The logs clearly show that it knows where those ranges can be found: DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) Bootstrap: range (1445230693272554509,1449885986247309687] exists on /10.11.12.13 DEBUG [main] 2014-08-14 18:56:45,019 RangeStreamer.java (line 122) Bootstrap: range (1445230693272554509,1449885986247309687] exists on /10.11.12.14 If you look for the various non-debug log lines for these addresses you'll see some odd behavior, but you see similar odd behavior in the logs of the node that is able to successfully bootstrap. $ grep '10.11.12.13' system.log | grep -v ^DEBUG INFO [GossipStage:1] 2014-08-14 18:56:09,870 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:15,267 Gossiper.java (line 910) Node /10.11.12.13 is now part of the cluster INFO [HANDSHAKE-/10.11.12.13] 2014-08-14 18:56:45,571 OutboundTcpConnection.java (line 386) Handshaking version with /10.11.12.13 INFO [RequestResponseStage:5] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:2] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:7] 2014-08-14 18:56:45,580 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:1] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP INFO [RequestResponseStage:8] 2014-08-14 18:56:45,579 Gossiper.java (line 876) InetAddress /10.11.12.13 is now UP $ grep '10.11.12.14' system.log | grep -v ^DEBUG INFO [GossipStage:1] 2014-08-14 18:56:09,865 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster INFO [GossipStage:1] 2014-08-14 18:56:15,106 Gossiper.java (line 910) Node /10.11.12.14 is now part of the cluster The general order of events, as told by JOINING lines: $ grep 'JOINING' system.log INFO [main] 2014-08-14 18:56:10,638 StorageService.java (line 1009) JOINING: waiting for ring information INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: schema complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,640 StorageService.java (line 1009) JOINING: waiting for pending range calculation INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: calculation complete, ready to bootstrap INFO [main] 2014-08-14 18:56:14,641 StorageService.java (line 1009) JOINING: getting bootstrap token INFO [main] 2014-08-14 18:56:14,712 StorageService.java (line 1009) JOINING: sleeping 30000 ms for pending range setup INFO [main] 2014-08-14 18:56:44,713 StorageService.java (line 1009) JOINING: Starting to bootstrap... Thanks in advance for any suggestions. Peter