ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Semen Boikov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test
Date Thu, 02 Jun 2016 09:53:59 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312057#comment-15312057
] 

Semen Boikov commented on IGNITE-3212:
--------------------------------------

Another issue, observed this stacktrace in thread dump at the moment when node logs 'Failed
to wait for partition release future' message:

{noformat}
Thread [name="disco-event-worker-#78%null%", id=94, state=RUNNABLE, blockCnt=6, waitCnt=15904]
        at o.a.i.i.processors.cache.transactions.IgniteTxManager.txsPreparedOrCommitted(IgniteTxManager.java:1830)
        at o.a.i.i.processors.cache.transactions.IgniteTxManager.txsPreparedOrCommitted(IgniteTxManager.java:1638)
        at o.a.i.i.processors.cache.distributed.GridCacheTxRecoveryFuture.prepare(GridCacheTxRecoveryFuture.java:189)
        at o.a.i.i.processors.cache.transactions.IgniteTxManager.commitIfPrepared(IgniteTxManager.java:1892)
        at o.a.i.i.processors.cache.distributed.GridCacheTxRecoveryFuture$MiniFuture.onNodeLeft(GridCacheTxRecoveryFuture.java:524)
        at o.a.i.i.processors.cache.distributed.GridCacheTxRecoveryFuture$MiniFuture.access$200(GridCacheTxRecoveryFuture.java:475)
        at o.a.i.i.processors.cache.distributed.GridCacheTxRecoveryFuture.onNodeLeft(GridCacheTxRecoveryFuture.java:404)
        at o.a.i.i.processors.cache.GridCacheMvccManager$3.onEvent(GridCacheMvccManager.java:253)
        at o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:770)
        at o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:755)
        at o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:295)
        at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2078)
        at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:2285)
        at o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2118)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

At least this execution should be moved out of 'disco-event-worker', also in 'txsPreparedOrCommitted'
there is an iteration over 'IgniteTxManager.completedVersHashMap', need check how long it
takes if completedVersHashMap has max allowed size (262144), probably it should be optimized
if there are multiple GridCacheTxRecoveryFutures.

> Servers get stuck with the warning "Failed to wait for initial partition map exchange"
during falover test
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-3212
>                 URL: https://issues.apache.org/jira/browse/IGNITE-3212
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Ksenia Rybakova
>            Assignee: Semen Boikov
>             Fix For: 1.7
>
>
> Servers being restarted during failover test get stuck after some time with the warning
"Failed to wait for initial partition map exchange". 
> {noformat}
> [08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node
to topology: TcpDiscoveryNode [id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs=
> [10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, /10.20.0.222:47503,
/127.0.0.1:47503], discPort=47503, order=44, intOrder=32, lastExchangeTime=1464
> 363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot
[ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node
to topology: TcpDiscoveryNode [id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs=
> [10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, /10.20.0.223:47503,
/127.0.0.1:47503], discPort=47503, order=45, intOrder=33, lastExchangeTime=1464
> 363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot
[ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB]
> [08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] Update status
is not available.
> [08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait for initial
partition map exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
> [08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting for initial
partition map exchange ...
> {noformat}
> "Failed to wait for partition release future" warnings are on other nodes.
> {noformat}
> [08:09:45,822][WARN ][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed
to wait for partition release future [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0],
node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending objects that might be the cause:
> [08:09:45,822][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready
affinity version: AffinityTopologyVersion [topVer=28, minorTopVer=1]
> [08:09:45,826][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last
exchange future: GridDhtPartitionsExchangeFuture ...
> {noformat}
> Load config:
> - 1 client, 20 servers (5 servers per 1 host)
> - warmup 60
> - duration 66h
> - preload 5M
> - key range 10M
> - operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL PUT_IF_ABSENT
REPLACE
> - backups count 3
> - 3 servers restart every 15 min with 30 sec step, pause between stop and start 5min



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message