ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Ignite instance fails. Unsure as to root cause
Date Thu, 15 Oct 2020 09:57:00 GMT
Hello!

Oct 12 17:49:41 nalrcsvridbq02 kernel: watchdog: BUG: soft lockup - CPU#0
stuck for 38s! [kworker/u256:0:3703404]

This is bad. Your system kernel says your CPU#0 was hanging for 38 seconds.

This is enough to trigger failure detection timeout and kill and instance,
or at least for the node to be segmented from cluster by the other nodes.

Regards,
-- 
Ilya Kasnacheev


ср, 14 окт. 2020 г. в 16:00, bbellrose <brian.bellrose@wabtec.com>:

> Oct 12 17:47:39 nalrcsvridbq02 Ignite[2031634]: [17:47:39] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=nio-acceptor-tcp-comm,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539219763]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=grid-nio-worker-tcp-comm-1,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260020]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=ttl-cleanup-worker,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260020]]]
> Oct 12 17:48:20 nalrcsvridbq02 Ignite[2031634]: [17:48:20] Possible failure
> suppressed accordingly to a configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
> o.a.i.IgniteException: GridWorker [name=db-checkpoint-thread,
> igniteInstanceName=RailConnect Ignite QA Grid, finished=false,
> heartbeatTs=1602539260006]]]
> Oct 12 17:49:00 nalrcsvridbq02 chronyd[1216]: Forward time jump detected!
> Oct 12 17:49:00 nalrcsvridbq02 chronyd[1216]: Can't synchronise: no
> selectable sources
> Oct 12 17:49:00 nalrcsvridbq02 process-agent[2039258]: 2020-10-12 17:49:00
> EDT | PROCESS | INFO | (collector.go:209 in func1) | Delivery queues:
> process[size=0, weight=0], pod[size=0, weight=0]
> Oct 12 17:49:00 nalrcsvridbq02 agent[2039257]: 2020-10-12 17:49:00 EDT |
> CORE | ERROR | (pkg/forwarder/worker.go:178 in process) | Error while
> processing transaction: error while sending transaction, rescheduling it:
> Post
>
> https://7-22-1-app.agent.datadoghq.com/api/v1/series?api_key=*************************44602
> :
> net/http: request canceled (Client.Timeout exceeded while awaiting headers)
> Oct 12 17:49:00 nalrcsvridbq02 trace-agent[2039259]: 2020-10-12 17:49:00
> EDT
> | TRACE | INFO | (pkg/trace/info/stats.go:101 in LogStats) | No data
> received
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00] Topology
> snapshot
> [ver=21, locNode=2eca41b3, servers=1, clients=0, state=ACTIVE, CPUs=2,
> offheap=2.0GB, heap=0.25GB]
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00]   ^-- Baseline
> [id=0, size=2, online=1, offline=1]
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: [17:49:00] (err) Failed to
> execute compound future reducer: GridNearTxFinishFuture
> [futId=7a28b630571-b3eac955-0171-4b45-b048-84653e88427e, tx=GridNearTxLocal
> [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey
> [key=KeyCacheObject
> [hasValBytes=true], cacheId=-27866919], val=BinaryObject
> [idHash=1523169004,
> hash=1743117496][op=CREATE, val=], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true,
> entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObject [hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=684422756,
> extras=null, flags=0]]], prepared=0, locked=false,
> nodeId=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee, locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=2, partUpdateCntr=0,
> serReadVer=null, xidVer=GridCacheVersion [topVer=212970798,
> order=1602505621810, nodeOrder=16]]], explicitLock=false,
> queryUpdate=false,
> dhtVer=null, last=false, nearEntries=0, clientFirst=false,
> node=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee]], nearLocallyMapped=false,
> colocatedLocallyMapped=false, needCheckBackup=false, hasRemoteLocks=false,
> trackTimeout=false, systemTime=0, systemStartTime=3310262430790233,
> prepareStartTime=3310262430790275, prepareTime=80685139741,
> commitOrRollbackStartTime=3310343115930116, commitOrRollbackTime=0,
>
> txDumpsThrottling=o.a.i.i.processors.cache.transactions.IgniteTxManager$TxDumpsThrottling@4b23d4ce
> ,
> lb=null, mvccOp=false, qryId=-1, crdVer=0,
> thread=client-connector-#129%RailConnect Ignite QA Grid%,
> mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey
> [key=KeyCacheObject
> [hasValBytes=true], cacheId=-27866919], val=BinaryObject
> [idHash=1523169004,
> hash=1743117496][op=CREATE, val=], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true,
> entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObject [hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=684422756,
> extras=null, flags=0]]], prepared=0, locked=false,
> nodeId=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee, locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=2, partUpdateCntr=0,
> serReadVer=null, xidVer=GridCacheVersion [topVer=212970798,
> order=1602505621810, nodeOrder=16]]], explicitLock=false,
> queryUpdate=false,
> dhtVer=null, last=false, nearEntries=0, clientFirst=false,
> node=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee]], super=GridDhtTxLocalAdapter
> [nearOnOriginatingNode=false, nearNodes=KeySetView [], dhtNodes=KeySetView
> [], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null,
> sndTransformedVals=false, depEnabled=false,
> txState=IgniteTxImplicitSingleStateImpl [init=true, recovery=false,
> useMvccCaching=false], super=IgniteTxAdapter [xidVer=GridCacheVersion
> [topVer=212970798, order=1602505621810, nodeOrder=16], writeVer=null,
> implicit=true, loc=true, threadId=195, startTime=1602539259845,
> nodeId=2eca41b3-7078-44ce-89d2-8bc6f67334a6, startVer=GridCacheVersion
> [topVer=212970798, order=1602505621810, nodeOrder=16], endVer=null,
> isolation=READ_COMMITTED, concurrency=OPTIMISTIC, timeout=0,
> sysInvalidate=false, sys=false, plc=2, commitVer=GridCacheVersion
> [topVer=212970798, order=1602505621810, nodeOrder=16], finalizing=NONE,
> invalidParts=null, state=PREPARED, timedOut=false,
> topVer=AffinityTopologyVersion [topVer=20, minorTopVer=1],
> mvccSnapshot=null, skipCompletedVers=false, parentTx=null,
> duration=80663ms,
> onePhaseCommit=true], size=1]]], commit=true,
> mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping
> [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey
> [key=KeyCacheObject
> [hasValBytes=true], cacheId=-27866919], val=BinaryObject
> [idHash=1523169004,
> hash=1743117496][op=CREATE, val=], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true,
> entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObject [hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=684422756,
> extras=null, flags=0]]], prepared=0, locked=false,
> nodeId=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee, locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=2, partUpdateCntr=0,
> serReadVer=null, xidVer=GridCacheVersion [topVer=212970798,
> order=1602505621810, nodeOrder=16]]], explicitLock=false,
> queryUpdate=false,
> dhtVer=null, last=false, nearEntries=0, clientFirst=false,
> node=3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee]], trackable=true,
> finishOnePhaseCalled=false, innerFuts=TransformCollectionView
> [CheckBackupFuture[node=2eca41b3-7078-44ce-89d2-8bc6f67334a6, loc=true,
> done=true]], super=GridCompoundIdentityFuture [super=GridCompoundFuture
> [rdc=AlwaysTrueReducer [], initFlag=0, lsnrCalls=0, done=false,
> cancelled=false, err=null, futs=TransformCollectionView [true]]]]class
> org.apache.ignite.internal.transactions.IgniteTxRollbackCheckedException:
> Failed to commit transaction (transaction has been rolled back on backup
> node): GridCacheVersion [topVer=212970798, order=1602505621810,
> nodeOrder=16]
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture.checkBackup(GridNearTxFinishFuture.java:618)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture.finish(GridNearTxFinishFuture.java:378)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal$25.apply(GridNearTxLocal.java:4129)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal$25.apply(GridNearTxLocal.java:4118)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:399)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:347)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:335)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:490)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.onComplete(GridNearOptimisticTxPrepareFuture.java:281)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.onError(GridNearOptimisticTxPrepareFuture.java:152)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.access$300(GridNearOptimisticTxPrepareFuture.java:76)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture$MiniFuture.onNodeLeft(GridNearOptimisticTxPrepareFuture.java:938)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.onNodeLeft(GridNearOptimisticTxPrepareFuture.java:127)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.GridCacheMvccManager$4.onEvent(GridCacheMvccManager.java:266)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1391)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:886)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:871)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:346)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:309)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2762)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:2990)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2794)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
> java.lang.Thread.run(Thread.java:748)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: Caused by: class
> org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Primary
> node left grid: 3f8b7981-ee81-4ca4-9f52-5c6f03cb8cee
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011at
>
> org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture.checkBackup(GridNearTxFinishFuture.java:612)
> Oct 12 17:49:00 nalrcsvridbq02 Ignite[2031634]: #011... 25 more
> Oct 12 17:49:00 nalrcsvridbq02 agent[2039257]: 2020-10-12 17:49:00 EDT |
> CORE | ERROR | (pkg/forwarder/worker.go:174 in process) | Too many errors
> for endpoint
> '
> https://7-22-1-app.agent.datadoghq.com/api/v1/series?api_key=*************************44602
> ':
> retrying later
> Oct 12 17:49:40 nalrcsvridbq02 chronyd[1216]: Forward time jump detected!
> Oct 12 17:49:41 nalrcsvridbq02 kernel: watchdog: BUG: soft lockup - CPU#0
> stuck for 38s! [kworker/u256:0:3703404]
> Oct 12 17:49:41 nalrcsvridbq02 kernel: Modules linked in: binfmt_misc
> nf_tables nfnetlink vmw_vsock_vmci_transport vsock ext4 mbcache jbd2
> intel_rapl_msr intel_rapl_common nfit libnvdimm crct10dif_pclmul
> crc32_pclmul ghash_clmulni_intel vmw_balloon intel_rapl_perf joydev pcspkr
> vmw_vmci i2c_piix4 auth_rpcgss sunrpc ip_tables xfs libcrc32c sr_mod cdrom
> vmwgfx ata_generic drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops sd_mod sg ttm ahci drm libahci crc32c_intel ata_piix serio_raw
> vmxnet3 libata vmw_pvscsi dm_mirror dm_region_hash dm_log dm_mod
> Oct 12 17:49:41 nalrcsvridbq02 kernel: CPU: 0 PID: 3703404 Comm:
> kworker/u256:0 Kdump: loaded Tainted: G             L   --------- -  -
> 4.18.0-193.14.2.el8_2.x86_64 #1
> Oct 12 17:49:41 nalrcsvridbq02 kernel: Hardware name: VMware, Inc. VMware
> Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/03/2018
> Oct 12 17:49:41 nalrcsvridbq02 kernel: Workqueue: writeback wb_workfn
> (flush-253:3)
> Oct 12 17:49:41 nalrcsvridbq02 kernel: RIP:
> 0010:_raw_spin_unlock_irqrestore+0x11/0x20
> Oct 12 17:49:41 nalrcsvridbq02 kernel: Code: 85 ff 48 29 e8 4c 39 e0 76 cf
> 80 0b 08 eb 8c 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 0f 1f
> 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 c3 66 0f 1f 84 00 00 00 00 00 0f 1f
> 44
> 00 00 8b 07
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Mime
View raw message