From issues-return-98405-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Wed Sep  4 20:00:10 2019
Return-Path: <issues-return-98405-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 374841804BB
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  4 Sep 2019 22:00:10 +0200 (CEST)
Received: (qmail 70073 invoked by uid 500); 5 Sep 2019 01:56:00 -0000
Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@ignite.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@ignite.apache.org>
List-Post: <mailto:issues@ignite.apache.org>
List-Id: <issues.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list issues@ignite.apache.org
Received: (qmail 69375 invoked by uid 99); 5 Sep 2019 01:56:00 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Sep 2019 01:56:00 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2D1D7E3115
	for <issues@ignite.apache.org>; Wed,  4 Sep 2019 20:00:08 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 69B3B78236A
	for <issues@ignite.apache.org>; Wed,  4 Sep 2019 20:00:04 +0000 (UTC)
Date: Wed, 4 Sep 2019 20:00:04 +0000 (UTC)
From: "Alexei Scherbakov (Jira)" <jira@apache.org>
To: issues@ignite.apache.org
Message-ID: <JIRA.13254170.1567410972000.1228.1567627204429@Atlassian.JIRA>
In-Reply-To: <JIRA.13254170.1567410972000@Atlassian.JIRA>
References: <JIRA.13254170.1567410972000@Atlassian.JIRA> <JIRA.13254170.1567410972481@jira-he-de>
Subject: [jira] [Commented] (IGNITE-12133) O(log n) partition exchange
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/IGNITE-12133?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16=
922809#comment-16922809 ]=20

Alexei Scherbakov commented on IGNITE-12133:
--------------------------------------------

PME protocol itself doesn't leverage ring and uses direct node to node comm=
unication for sending partition maps (except for special case), but ring is=
 used by discovery protocol, which "discovers" topology changes and deliver=
s event to grid nodes, which triggers PME due to topology changes, for exam=
ple "node left" or "node added".
Also discovery protocol provides "guaranteed ordered messages delivery" whi=
ch is extensively used by Ignite internals and cannot be replaced easily.

Actually, PME consists of three phases:

1. Discovery phase, having O(n) complexity for default TcpDiscoverySpi impl=
ementation.
2. Topology unlock waiting (out of this post's scope).
3. PME phase having k * O(m) complextity where m is number of I/O sender th=
reads and k depends on topology size.

So total PME complexity is sum of 1 and 3.
To speed up PME we should improve 1 and 3.

How to improve 1 ?
Initially ring was designed for small topologies and still works very well =
for such cases with default settings.
Specially for large topologies zookeeper based discovery was introduced, wh=
ich have better complexity.
So, for small topologies I suggest to use defaults.
For large topologies zookeeper discovery should be used.

How to improve 3 ?
For small topologies same as 1, use defaults.
For large topologies we could use [~mnk]'s proposal and use tree-like messa=
ge propagation pattern to achieve log(N) complexity.
I agree with [~ivan.glukos] on increasing failover complexity, but I think =
it's doable.
NOTE: same idea could be used for increasing replicated caches performance =
on large topologies. We have long time known issue with performance degrada=
tion if topology is large.

[~Jokser]=20
Gossip idea looks interesting, but looks like complicated change and reinve=
nting the wheel. Why not stick to zookeeper?


> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that communica=
tions is O\(n) in number of nodes. It also means that if non-coordinator no=
des hang it can take much longer to successfully resolve the topology.
> Instead, why not use something like a skip-list where the coordinator is =
first. The coordinator can notify the first node at each level of the skip-=
list. Each node then notifies all of its "near-neighbours" in the skip-list=
, where node B is a near-neighbour of node-A, if max-level(nodeB) <=3D max-=
level(nodeA), and nodeB is the first node at its level when traversing from=
 nodeA in the direction of nodeB, skipping over nodes C which have max-leve=
l(C) > max-level(A).=C2=A0
> 1
> 1 .=C2=A0 .=C2=A0 .3
> 1=C2=A0 =C2=A0 =C2=A0 =C2=A0 3 . .=C2=A0 . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and =
4 -> 6, and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both di=
rections, and having the coordinator also notify the last node in the list =
at each level. This way in the above example if 2 and 3 were both down, 4 w=
ould still get notified from 5 and 6 (in the backwards direction).
> =C2=A0
> The idea is that each individual node has O(log n) nodes to notify - so t=
he overall time is reduced. Additionally, we can deal well with at least 1 =
node failure - if one includes the option of processing backwards, 2 consec=
utive node failures can be handled as well. By taking this kind of an appro=
ach, then the coordinator can basically treat any nodes it didn't receive a=
 message from as not-connected, and update the topology as well (disconnect=
ing any nodes that it didn't get a notification from). While there are some=
 edge cases here (e.g. 2 disconnected nodes, then 1 connected node, then 2 =
disconnected nodes - the connected node would be wrongly ejected from the t=
opology), these would generally be too rare to need explicit handling for.


--
This message was sent by Atlassian Jira
(v8.3.2#803003)