ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Muzafarov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins
Date Sat, 14 Jul 2018 19:55:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541426#comment-16541426
] 

Maxim Muzafarov edited comment on IGNITE-7165 at 7/14/18 7:54 PM:
------------------------------------------------------------------

h5. Changes ready
 * TC: [#2722 (14 Jul 18 19:46)|https://ci.ignite.apache.org/viewLog.html?buildId=1497012&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll]
 * PR: [#4097|https://github.com/apache/ignite/pull/4097]
 * Upsource: [IGNT-CR-670|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-670]

h5. Implementation details
 # _Keep topology version to demand (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we should save
version on which rebalance is being currently running. Update this version from exchange thread
after PME will keep us away from unnecessary processing of stale supply messages.
 # _{{RebalanceFuture.demanded}} to process cache groups independently_
 We have a long chain for starting rebalance process of cache groups builded by {{addAssignments}}
method (e.g. {{ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2}}). If rebalance
started but initial demand message for some groups have not been sent (e.g. due to long cleaning\evicting
processes previous groups) it can be easily cancelled and started new rebalance future.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need to stop rebalance
for this cache each time new topology version arrived. Rebalance should be run only once,
except situations when nodes {{LEFT}} or {{FAIL}} cluster from which cache partition being
demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current topology version
(return empty map) no matter how affinity changed we should return successfull result as fast
as possible.
 # _Pengind exchanges handling (cancelled assignments)_
 Exchange thread can have pending exchanges in it's queue ({{hasPendingExchanges}} method).
If such pending changes exists starting new rebalance routine has no meaning and we should
skip rebalance. This pengind changes can cause no affinity assignments partition changes in
our case and that's why we do not need to cancel current rebalance future.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign on local node
regarding new affinity calculation. Processing stale supply message (on previous versions)
can lead to exceptions with getting partitions on local node with incorrect state. Thats why
stale {{GridDhtPartitionSupplyMessage}} must be ignored by {{Demander}}.
 # _Clear suppy contex map changed_
 Previously, supply context map have been cleared after each topology version change occurs.
Since we can preform rebalance not on the latest topology version this behavior should be
changed. Clear context only for nodes left\failed from topology.
 # _{{LEFT}} or {{FAIL}} nodes from cluster (rebalance restart)_
 If rebalance future demand partitions from nodes which have left the cluster rebalance must
be restarted.
 # _OWNING → MOVING on coordinator due to obsolete partititon update counter_
 Affinity assingment can have no chanes and rebalance is currently running. Coordinator performs
PME and after megre all SingleMessages marks partitions with obsolete update sequence to be
demanded from remote nodes (by change OWNING -> MOVING partition state). We should schedule
new rebalance in this case.


was (Author: mmuzaf):
h5. Changes ready
 * TC: [#2636 (11 Jul 18 21:20)|https://ci.ignite.apache.org/viewLog.html?buildId=1479780&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll]
 * PR: [#4097|https://github.com/apache/ignite/pull/4097]
 * Upsource: [IGNT-CR-670|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-670]

h5. Implementation details
 # _Keep topology version to demand (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we should save
version on which rebalance is being currently running. Update this version from exchange thread
after PME will keep us away from unnecessary processing of stale supply messages.
 # _{{RebalanceFuture.demanded}} to process cache groups independently_
 We have a long chain for starting rebalance process of cache groups builded by {{addAssignments}}
method (e.g. {{ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2}}). If rebalance
started but initial demand message for some groups have not been sent (e.g. due to long cleaning\evicting
processes previous groups) it can be easily cancelled and started new rebalance future.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need to stop rebalance
for this cache each time new topology version arrived. Rebalance should be run only once,
except situations when nodes {{LEFT}} or {{FAIL}} cluster from which cache partition being
demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current topology version
(return empty map) no matter how affinity changed we should return successfull result as fast
as possible.
 # _Pengind exchanges handling (cancelled assignments)_
 Exchange thread can have pending exchanges in it's queue ({{hasPendingExchanges}} method).
If such pending changes exists starting new rebalance routine has no meaning and we should
skip rebalance. This pengind changes can cause no affinity assignments partition changes in
our case and that's why we do not need to cancel current rebalance future.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign on local node
regarding new affinity calculation. Processing stale supply message (on previous versions)
can lead to exceptions with getting partitions on local node with incorrect state. Thats why
stale {{GridDhtPartitionSupplyMessage}} must be ignored by {{Demander}}.
 # _Clear suppy contex map changed_
 Previously, supply context map have been cleared after each topology version change occurs.
Since we can preform rebalance not on the latest topology version this behavior should be
changed. Clear context only for nodes left\failed from topology.
 # _{{LEFT}} or {{FAIL}} nodes from cluster (rebalance restart)_
 If rebalance future demand partitions from nodes which have left the cluster rebalance must
be restarted.
 # _OWNING → MOVING on coordinator due to obsolete partititon update counter_
 Affinity assingment can have no chanes and rebalance is currently running. Coordinator performs
PME and after megre all SingleMessages marks partitions with obsolete update sequence to be
demanded from remote nodes (by change OWNING -> MOVING partition state). We should schedule
new rebalance in this case.

> Re-balancing is cancelled if client node joins
> ----------------------------------------------
>
>                 Key: IGNITE-7165
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7165
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>              Labels: rebalance
>             Fix For: 2.7
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours and each time
when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager] Added
new node to topology: TcpDiscoveryNode [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1,
127.0.0.1, 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, /172.31.16.213:0],
discPort=0, order=36, intOrder=24, lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager] Topology
snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef,
customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], resVer=AffinityTopologyVersion
[topVer=36, minorTopVer=0], err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=36, minorTopVer=0],
evt=NODE_JOINED, node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Cancelled
rebalancing from all nodes [topology=AffinityTopologyVersion [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
Rebalancing started [top=null, evt=NODE_JOINED, node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] Starting
rebalancing [mode=ASYNC, fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18,
topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
> so in clusters with a big amount of data and the frequent client left/join events this
means that a new server will never receive its partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message