Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A464D200C8E for ; Thu, 25 May 2017 01:02:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A2F0F160BD0; Wed, 24 May 2017 23:02:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EA36E160BB6 for ; Thu, 25 May 2017 01:02:16 +0200 (CEST) Received: (qmail 29340 invoked by uid 500); 24 May 2017 23:02:15 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 29326 invoked by uid 99); 24 May 2017 23:02:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 May 2017 23:02:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1446E1AF921 for ; Wed, 24 May 2017 23:02:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.102 X-Spam-Level: X-Spam-Status: No, score=-0.102 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Yzu6Nex8EEyX for ; Wed, 24 May 2017 23:02:09 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.219]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id E76DE5F2AB for ; Wed, 24 May 2017 23:02:07 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id 8FE6E2480 for ; Wed, 24 May 2017 17:01:56 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-type:content-type:in-reply-to :mime-version:user-agent:date:date:message-id:from:from :references:subject:subject:received:received; s=mail; t= 1495666916; bh=OiS09oBeBzQfj0ZFYIALChuegqNgGCcSV87ieZuxUWA=; b=g S4K4hswPApO5vHPnCJQ7MVGUQGwxc1fgptE4jOnZMBHo/tvj4NNzpnt1m+dxdWPB epgUjcNH20BvvvBSBZEt3JI0Fv8z6yBUY6W1vzIHJvBhBfwjMJDfj2NCKAi91FCI LsI1F03slnCDETQn4HbBvuvRI7kTA6fHZqmUV1Ttu0= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id FztOoUVaxtPy for ; Wed, 24 May 2017 17:01:56 -0600 (MDT) Received: from [10.2.0.108] (client175.mainstreamdata.com [209.63.42.175]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id AA7F6247F for ; Wed, 24 May 2017 17:01:55 -0600 (MDT) Subject: Re: Spread SolrCloud across two locations To: solr-user@lucene.apache.org References: <18751a00-848f-4bff-a3e5-ab1be7d356fb@elyograg.org> <93178F40-D201-4D8C-83A8-234A422C2D61@cominvent.com> From: Shawn Heisey Message-ID: Date: Wed, 24 May 2017 17:01:48 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <93178F40-D201-4D8C-83A8-234A422C2D61@cominvent.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit archived-at: Wed, 24 May 2017 23:02:17 -0000 On 5/24/2017 4:14 PM, Jan Høydahl wrote: > Sure, ZK does by design not support a two-node/two-location setup. But still, users may want/need to deploy that, > and my question was if there are smart ways to make such a setup as little painful as possible in case of failure. > > Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM. > Without an active action DC2 would be read-only > What if then the Ops personnel in DC2 could, with a single script/command, instruct DC2 to resume “master” role: > - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up. > - Rolling restart of Solr nodes with new ZK_HOST string > Of course, they would also then need to make sure that DC1 does not boot up again before compatible change has been done there too. When ZK 3.5 comes out and SolrCloud is updated to use it, I think that it might be possible to remove the dc1 servers from the ensemble and add another server in dc2 to re-form a new quorum, without restarting anything. It could be quite some time before a stable 3.5 is available, based on past release history. They don't release anywhere near as often as Lucene/Solr does. With the current ZK version, I think your procedure would work, but I definitely wouldn't call it painless. Indexing would be unavailable when dc1 goes down, and everything could be down while the restarts are happening. Whether ZK 3.5 is there or not, there is potential unknown behavior when dc1 comes back online, unless you can have dc1 personnel shut the servers down, or block communication between your servers in dc1 and dc2. Overall, having one or two ZK servers in each main DC and a tiebreaker ZK on a low-cost server in a third DC seems like a better option. There's no intervention required when a DC goes down, or when it comes back up. Thanks, Shawn