From dev-return-74907-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Tue Oct 23 04:02:50 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 49B6418064A for ; Tue, 23 Oct 2018 04:02:50 +0200 (CEST) Received: (qmail 3716 invoked by uid 500); 23 Oct 2018 02:02:49 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 3704 invoked by uid 99); 23 Oct 2018 02:02:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2018 02:02:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 203E2CE594 for ; Tue, 23 Oct 2018 02:02:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.897 X-Spam-Level: * X-Spam-Status: No, score=1.897 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WS6puJ-clQSS for ; Tue, 23 Oct 2018 02:02:47 +0000 (UTC) Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 52E715F230 for ; Tue, 23 Oct 2018 02:02:47 +0000 (UTC) Received: by mail-qk1-f180.google.com with SMTP id y8-v6so26697657qka.11 for ; Mon, 22 Oct 2018 19:02:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=ag2d88UHMEcn/Q9ee1Yf2acxuUBBU7VOJu7hWys1enk=; b=lNp5bgaW2HOWobIuaClcdZm4StdW6O0LF76WIeDiDw5+4TATykegHvXgpjoxzRq+/A aLV9CynsgLGqJBd22sAcHlfimakoLM0dUch6FXvv80OS/P3yLv8ivP1YDQZdKt4ROEpr 0YYYXwWOPdNZ4YEWI8rxEk9b9EMXMKJg9WOKrXW7F260sVJWRoGvybL0NPCWXH8lAumh 6OmM251c1hLVb38l4WxBubSdpv8Qn4HeTiwYA8AdOitrOPPM/HKeVLH1+gdiFixttJv0 wZVX1SZyJn4n1r65nzf+VHDmAXm5OHWICsJtG/ksM7LdwXB4Hti8pMTZnoGpQFYWSzH+ achw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=ag2d88UHMEcn/Q9ee1Yf2acxuUBBU7VOJu7hWys1enk=; b=G3Y0dvLUcdJXL9QR5u3XnVxo32xvmCExTqnK8sT6eCx4zzuKtp447ip99Jsv9WAXp1 9vBDolPkrqg1nxsYs5V1X0f6e5QIyNxRBDm9y2DbrrteW+B8YL2aYBdfkPotHnZgznjl QWBo/XcS9xy/UEAc9Chz/MoIwlt1hXe1lY3V+o820uGi4p6xbFqJnCLxKjAdOJX+3g3f 8vu1i3n5j/s+MFT+tZGXZ5ut5OXP8woCK4H+1xLb1tmeYU54AOPALus6QB4mmz8jDluL 3HD+nauCqVJ2qKd7tjtvfAR6BfbnWAGUd8kh6hoEimwBCX/OccIZQbMgITgLjWLZ3ffJ nH2g== X-Gm-Message-State: AGRZ1gJHl2W/oar2pMtXBo/PYT4DF13RBBRMyzNNxCbbs1rHhWxI/iQk rI6FR4xbUcWu9upMH2xND2BUsvjnIdaUS6MECExLb3I= X-Google-Smtp-Source: AJdET5fMD3eJtjxXiuQDjxikRh/9jlNLOYPRNNLAhpCZBj2asQucKX8KLzWzGQQUUFT1pMOse7XAxaxOgHc4nAXIeRM= X-Received: by 2002:a37:1ad9:: with SMTP id l86mr4869040qkh.54.1540260160904; Mon, 22 Oct 2018 19:02:40 -0700 (PDT) MIME-Version: 1.0 From: Ted Dunning Date: Mon, 22 Oct 2018 19:02:12 -0700 Message-ID: Subject: improving tolerance to network failures To: dev@zookeeper.apache.org Content-Type: multipart/alternative; boundary="000000000000c0a1960578dbc39e" --000000000000c0a1960578dbc39e Content-Type: text/plain; charset="UTF-8" I am starting work on a project to improve the tolerance of Zookeeper to network failures and would like feedback on the idea. The problem is that with environments where link bonding is forbidden (they exist, trust me), Zookeeper is sensitive to the loss of a single switch or a few network links. This applies to client and server. Upon examination of the problem, I think that this could be mitigated by changing the logic that opens connections between servers to try one of several options. This should be a small change. I think that dynamic reconfiguration should be fine with this as well. On the client side, the situation is simpler, we can simply provide, either by configuration or from the server cluster, a list of all possible addresses and the client's current connection logic should work fine. One worry I have has to do with certificates on secure connection, but it seems that multiple certs would work the trick. I have started a collaborative document to work on the design approach. Once that is judged by the community to be sufficiently mature, I will move it to a JIRA. That document is at https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing The design document is currently open to the world for commenting so that anybody can suggest changes or ask questions. I will act as a bit of a moderator so that the document can remain completely open. --000000000000c0a1960578dbc39e--