ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Muzafarov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-7196) Exchange can stuck and wait while new node restoring state from disk and starting caches
Date Wed, 26 Sep 2018 08:33:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628412#comment-16628412
] 

Maxim Muzafarov edited comment on IGNITE-7196 at 9/26/18 8:32 AM:
------------------------------------------------------------------

[~dpavlov], [~DmitriyGovorukhin],

In general, I don't think that it's possible to get clear TC Bot visa on such PRs. I've noticed
that if TC agents are under the pressure we will get failure suites more often.

But, test {{IgnitePdsDestroyCacheTest.testDestroyCachesAbruptly}} *definitely failed due to
my changes*. It's flaky - [the next PDS (Direct IO) 1|https://ci.ignite.apache.org/viewLog.html?buildId=1942756]
is OK here. Probable cause – we remove cache configuration from the store before writing
checkpoint end marker, so on binary restore, we will get an exception with missing cache descriptor
and it's not happened at the master branch due to getting cache descriptors after joining
node to the cluster. 

I need some time to investigate the original cause.

[PDS (Direct IO) 1 - #3660 (26 Sep 18 09:37)|https://ci.ignite.apache.org/viewLog.html?buildId=1945365&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_PdsDirectIo1]
-- all tests passed


was (Author: mmuzaf):
[~dpavlov], [~DmitriyGovorukhin],

In general, I don't think that it's possible to get clear TC Bot visa on such PRs. I've noticed
that if TC agents are under the pressure we will get failure suites more often.

But, test {{IgnitePdsDestroyCacheTest.testDestroyCachesAbruptly}} *definitely failed due to
my changes*. It's flaky - [the next PDS (Direct IO) 1|https://ci.ignite.apache.org/viewLog.html?buildId=1942756]
is OK here. Probable cause – we remove cache configuration from the store before writing
checkpoint end marker, so on binary restore, we will get an exception with missing cache descriptor
and it's not happened at the master branch due to getting cache descriptors after joining
node to the cluster. 
I need some time to investigate the original cause.

> Exchange can stuck and wait while new node restoring state from disk and starting caches
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-7196
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7196
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.3
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>             Fix For: 2.7
>
>
> Exchange can stuck and wait while new node restoring state from disk and starting caches,
there's a log snippet from a just joined new node that shows the issue:
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], crd=false, evt=NODE_JOINED, evtNode=3ac1160e-0de4-41bc-a366-59292c9f03c1,
customEvt=null, allowMerge=true]
> [21:36:13,023][INFO][exchange-worker-#62%statement_grid%][FilePageStoreManager] Resolved
page store work directory: /mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Resolved
write ahead log work directory: /mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,024][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Resolved
write ahead log archive directory: /mnt/wal/WAL_archive/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463
> [21:36:13,046][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Started
write-ahead log manager [mode=DEFAULT]
> [21:36:13,065][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] Started page
memory [memoryAllocated=100.0 MiB, pages=6352, tableSize=373.4 KiB, checkpointBuffer=100.0
MiB]
> [21:36:13,105][INFO][exchange-worker-#62%statement_grid%][PageMemoryImpl] Started page
memory [memoryAllocated=32.0 GiB, pages=2083376, tableSize=119.6 MiB, checkpointBuffer=896.0
MiB]
> [21:36:13,428][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930869357-1c24b6dc-d64c-4b83-8166-11edf1bfdad3-END.bin]
> [21:36:13,429][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checking memory state [lastValidPos=FileWALPointer [idx=3582, fileOffset=59186076, len=9229,
forceFlush=false], lastMarked=FileWALPointer [idx=3629, fileOffset=50829700, len=9229, forceFlush=false],
lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:36:13,429][WARNING][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Ignite node stopped in the middle of checkpoint. Will restore memory state and finish checkpoint
on node start.
> [21:36:18,312][INFO][grid-nio-worker-tcp-comm-0-#41%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.17.115:57148]
> [21:36:21,619][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Found last checkpoint marker [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, pos=FileWALPointer
[idx=3629, fileOffset=50829700, len=9229, forceFlush=false]]
> [21:36:21,620][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying memory changes [changesApplied=165103, time=8189ms]
> [21:36:22,403][INFO][grid-nio-worker-tcp-comm-1-#42%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.28.10:47964]
> [21:36:23,414][INFO][grid-nio-worker-tcp-comm-2-#43%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.27.101:46000]
> [21:36:33,019][WARNING][main][GridCachePartitionExchangeManager] Failed to wait for initial
partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> [21:36:53,021][WARNING][main][GridCachePartitionExchangeManager] Still waiting for initial
partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931012268, loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], topVer=57, nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.31.27.101], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500,
/127.0.0.1:48500, ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], discPort=48500,
order=57, intOrder=36, lastExchangeTime=1512931012268, loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], topVer=57, nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=820635914], init=false, lastVer=null, partReleaseFut=null, exchActions=null,
affChangeMsg=null, initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, done=false,
state=SRV, evtLatch=0, remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, ee15ea85-f5f6-4513-871e-7b4b080911cf,
56c97317-26cf-43d2-bf76-0cab59c6fa5f, f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], super=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=542833742]]]
> [21:37:12,866][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=306c0895-1f5f-4237-bebf-8bf2b49682af, pages=157778, markPos=FileWALPointer
[idx=3629, fileOffset=50829700, len=9229, forceFlush=false], pagesWrite=5460ms, fsync=45781ms,
total=51241ms]
> [21:37:12,878][INFO][exchange-worker-#62%statement_grid%][FileWriteAheadLogManager] Resuming
logging to WAL segment [file=/mnt/wal/WAL/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/0000000000000009.wal,
offset=50838929, ver=1]
> [21:37:13,028][WARNING][main][GridCachePartitionExchangeManager] Still waiting for initial
partition map exchange [fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500],
discPort=48500, order=57, intOrder=36, lastExchangeTime=1512931032324, loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], topVer=57, nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
crd=TcpDiscoveryNode [id=56c97317-26cf-43d2-bf76-0cab59c6fa5f, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 172.31.27.101], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500, /127.0.0.1:48500, ip-172-31-27-101.eu-central-1.compute.internal/172.31.27.101:48500],
discPort=48500, order=1, intOrder=1, lastExchangeTime=1512930972830, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=3ac1160e-0de4-41bc-a366-59292c9f03c1,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.20.209], sockAddrs=[/0:0:0:0:0:0:0:1%lo:48500,
/127.0.0.1:48500, ip-172-31-20-209.eu-central-1.compute.internal/172.31.20.209:48500], discPort=48500,
order=57, intOrder=36, lastExchangeTime=1512931032324, loc=true, ver=2.3.1#20171129-sha1:4b1ec0fe,
isClient=false], topVer=57, nodeId8=3ac1160e, msg=null, type=NODE_JOINED, tstamp=1512930972992],
nodeId=3ac1160e, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=820635914], init=false, lastVer=null, partReleaseFut=null, exchActions=null,
affChangeMsg=null, initTs=1512930973012, centralizedAff=false, changeGlobalStateE=null, done=false,
state=SRV, evtLatch=0, remaining=[eac777d5-281b-4a2e-a743-c585a08bfc68, ee15ea85-f5f6-4513-871e-7b4b080911cf,
56c97317-26cf-43d2-bf76-0cab59c6fa5f, f87821f1-e4d9-49b9-9dab-39ee6a620b87, 853066b2-fb70-475e-8a77-b41f180df164,
979e0b69-05db-4652-98f2-a59c8f56acf8, 11c1f792-92cb-44a5-a481-1aa8cd257aef], super=GridFutureAdapter
[ignoreInterrupts=false, state=INIT, res=null, hash=542833742]]]
> [21:37:13,116][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] Started
cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL]
> [21:37:13,125][INFO][exchange-worker-#62%statement_grid%][GridCacheProcessor] Started
cache [name=statementp, memoryPolicyName=Default_Region, mode=PARTITIONED, atomicity=TRANSACTIONAL]
> [21:37:13,131][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=57,
minorTopVer=0], waitTime=0ms, futInfo=NA]
> [21:37:13,133][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Read checkpoint status [startMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-START.bin,
endMarker=/mnt/store/node00-d1eb270c-d2cc-4550-87aa-64f6df2a9463/cp/1512930965253-306c0895-1f5f-4237-bebf-8bf2b49682af-END.bin]
> [21:37:13,134][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Applying lost cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=3629,
fileOffset=50829700, len=9229, forceFlush=false], lastCheckpointId=306c0895-1f5f-4237-bebf-8bf2b49682af]
> [21:37:13,738][INFO][exchange-worker-#62%statement_grid%][GridCacheDatabaseSharedManager]
Finished applying WAL changes [updatesApplied=0, time=607ms]
> [21:37:13,749][INFO][exchange-worker-#62%statement_grid%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], crd=false]
> [21:37:13,782][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture] Received
full message, will finish exchange [node=56c97317-26cf-43d2-bf76-0cab59c6fa5f, resVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0]]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-4-#45%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.24.98:58928]
> [21:37:13,796][INFO][grid-nio-worker-tcp-comm-3-#44%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.27.95:52532]
> [21:37:13,804][INFO][grid-nio-worker-tcp-comm-5-#46%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.17.174:54358]
> [21:37:13,805][INFO][grid-nio-worker-tcp-comm-6-#47%statement_grid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.31.20.209:48100, rmtAddr=/172.31.23.109:42528]
> [21:37:17,569][INFO][sys-#72%statement_grid%][GridDhtPartitionsExchangeFuture] Finish
exchange future [startVer=AffinityTopologyVersion [topVer=57, minorTopVer=0], resVer=AffinityTopologyVersion
[topVer=57, minorTopVer=0], err=null]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message