ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moti Nisenson-Ken (Jira)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-12133) O(log n) partition exchange
Date Sun, 27 Oct 2019 08:27:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960532#comment-16960532
] 

Moti Nisenson-Ken commented on IGNITE-12133:
--------------------------------------------

After looking a bit deeper into it, I think that [~ivan.glukos] is on the right track. It
would make most sense to implement this first as a general mechanism for broadcasting a message
to a set of nodes, before looking into this for discovery. That way it could be leveraged
where it makes sense, and then a decision could be made whether it makes sense to use as part
of discovery or not when it is stable.

In this case, we'd talk about the initiator (instead of the coordinator), and the skip list
would then start from that the initiator. Since it's general, there should be efficient ways
to support common cases from a message format perspective. Common cases would likely be things
like : when the coordinator is the initiator, where the set of nodes is all nodes, or the
set of nodes are those nodes having data for a cache, or all baseline topology nodes, etc.
For a random subset of nodes it would require including the nodes to be contacted in the message
of course.

> O(log n) partition exchange
> ---------------------------
>
>                 Key: IGNITE-12133
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12133
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Moti Nisenson-Ken
>            Priority: Major
>
> Currently, partition exchange leverages a ring. This means that communications is O\(n)
in number of nodes. It also means that if non-coordinator nodes hang it can take much longer
to successfully resolve the topology.
> Instead, why not use something like a skip-list where the coordinator is first. The coordinator
can notify the first node at each level of the skip-list. Each node then notifies all of its
"near-neighbours" in the skip-list, where node B is a near-neighbour of node-A, if max-level(nodeB)
<= max-level(nodeA), and nodeB is the first node at its level when traversing from nodeA
in the direction of nodeB, skipping over nodes C which have max-level(C) > max-level(A). 
> 1
> 1 .  .  .3
> 1        3 . .  . 5
> 1 . 2 . 3 . 4 . 5 . 6
> In the above 1 would notify 2 and 3, 3 would notify 4 and 5, 2 -> 4, and 4 -> 6,
and 5 -> 6.
> One can achieve better redundancy by having each node traverse in both directions, and
having the coordinator also notify the last node in the list at each level. This way in the
above example if 2 and 3 were both down, 4 would still get notified from 5 and 6 (in the backwards
direction).
>  
> The idea is that each individual node has O(log n) nodes to notify - so the overall time
is reduced. Additionally, we can deal well with at least 1 node failure - if one includes
the option of processing backwards, 2 consecutive node failures can be handled as well. By
taking this kind of an approach, then the coordinator can basically treat any nodes it didn't
receive a message from as not-connected, and update the topology as well (disconnecting any
nodes that it didn't get a notification from). While there are some edge cases here (e.g.
2 disconnected nodes, then 1 connected node, then 2 disconnected nodes - the connected node
would be wrongly ejected from the topology), these would generally be too rare to need explicit
handling for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message