ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matija Polajnar (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery
Date Wed, 16 Oct 2019 11:26:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952729#comment-16952729
] 

Matija Polajnar edited comment on IGNITE-10226 at 10/16/19 11:25 AM:
---------------------------------------------------------------------

On development environments (for now, luckily) we sometimes get errors like this one:
{code:java}
    ...
Caused by: javax.cache.CacheException: class org.apache.ignite.cluster.ClusterTopologyException:
Cannot run update query. Node must own all the necessary partitions.
    at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
~[ignite-core-2.7.0.jar:2.7.0]
    at com.marand.thinkehr.tasks.common.ignite.IgniteCompletableFuture.lambda$new$2ae3f52e$1(IgniteCompletableFuture.java:25)
~[classes/:?]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:215)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:179)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.listen(IgniteFutureImpl.java:71)
~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.cluster.ClusterTopologyException: Cannot run update query. Node
must own all the necessary partitions.
    at org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:888) ~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:886) ~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Cannot run
update query. Node must own all the necessary partitions.
    at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.checkPartitions(GridDhtTxAbstractEnlistFuture.java:922)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.init(GridDhtTxAbstractEnlistFuture.java:336)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.enlistLocal(GridNearTxEnlistFuture.java:518)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendBatch(GridNearTxEnlistFuture.java:413)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendNextBatches(GridNearTxEnlistFuture.java:168)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.map(GridNearTxEnlistFuture.java:144)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxAbstractEnlistFuture.init(GridNearTxAbstractEnlistFuture.java:241)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.updateAsync(GridNearTxLocal.java:2099)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.mvccRemoveAllAsync0(GridNearTxLocal.java:1976)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1689)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:554)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter$40.op(GridCacheAdapter.java:3174)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter$AsyncOp.op(GridCacheAdapter.java:5288)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4450)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4345)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync0(GridCacheAdapter.java:3172)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync(GridCacheAdapter.java:3159)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAllAsync(IgniteCacheProxyImpl.java:1342)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAllAsync(GatewayProtectedCacheProxy.java:1072)
~[ignite-core-2.7.0.jar:2.7.0]
    ... {code}
Given that we use Ignite embedded into java application, it probably gets shut down uncleanly
a lot in development. This is typically a single-node machine. Backup count is set to 1, but
there is only one node anyway (so I'm not sure why partition would be MOVING any time anyway).

I set a breakpoint in GridDhtTxAbstractEnlistFuture.checkPartitions and found the offending
partitions had a status of MOVING.

I suspect this might also be the cause for sometimes IgniteCache.get( x ) and IgniteCache.containsKey(
x ) returning null and false respectively despite the cache certainly containing the key x
with a non-null value (i.e. cache.containsKey(cache.iterator().next().getKey()) returns false).

resetLostPartitions probably has no effect in this case?


was (Author: matijap):
On development environments (for now, luckily) we sometimes get errors like this one:
{code:java}
    ...
Caused by: javax.cache.CacheException: class org.apache.ignite.cluster.ClusterTopologyException:
Cannot run update query. Node must own all the necessary partitions.
    at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
~[ignite-core-2.7.0.jar:2.7.0]
    at com.marand.thinkehr.tasks.common.ignite.IgniteCompletableFuture.lambda$new$2ae3f52e$1(IgniteCompletableFuture.java:25)
~[classes/:?]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:215)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl$InternalFutureListener.apply(IgniteFutureImpl.java:179)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.listen(IgniteFutureImpl.java:71)
~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.cluster.ClusterTopologyException: Cannot run update query. Node
must own all the necessary partitions.
    at org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:888) ~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.IgniteUtils$7.apply(IgniteUtils.java:886) ~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137)
~[ignite-core-2.7.0.jar:2.7.0]
    ...
Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Cannot run
update query. Node must own all the necessary partitions.
    at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.checkPartitions(GridDhtTxAbstractEnlistFuture.java:922)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxAbstractEnlistFuture.init(GridDhtTxAbstractEnlistFuture.java:336)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.enlistLocal(GridNearTxEnlistFuture.java:518)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendBatch(GridNearTxEnlistFuture.java:413)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.sendNextBatches(GridNearTxEnlistFuture.java:168)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxEnlistFuture.map(GridNearTxEnlistFuture.java:144)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxAbstractEnlistFuture.init(GridNearTxAbstractEnlistFuture.java:241)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.updateAsync(GridNearTxLocal.java:2099)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.mvccRemoveAllAsync0(GridNearTxLocal.java:1976)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync0(GridNearTxLocal.java:1689)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.removeAllAsync(GridNearTxLocal.java:554)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter$40.op(GridCacheAdapter.java:3174)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter$AsyncOp.op(GridCacheAdapter.java:5288)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4450)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.asyncOp(GridCacheAdapter.java:4345)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync0(GridCacheAdapter.java:3172)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GridCacheAdapter.removeAllAsync(GridCacheAdapter.java:3159)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.removeAllAsync(IgniteCacheProxyImpl.java:1342)
~[ignite-core-2.7.0.jar:2.7.0]
    at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.removeAllAsync(GatewayProtectedCacheProxy.java:1072)
~[ignite-core-2.7.0.jar:2.7.0]
    ... {code}
Given that we use Ignite embedded into java application, it probably gets shut down uncleanly
a lot in development. This is typically a single-node machine. Backup count is set to 1, but
there is only one node anyway (so I'm not sure why partition would be MOVING any time anyway).

I set a breakpoint in GridDhtTxAbstractEnlistFuture.checkPartitions and found the offending
partitions had a status of MOVING.

I suspect this might also be the cause for sometimes IgniteCache.get(x) and IgniteCache.containsKey(x)
returning null and false respectively despite the cache certainly containing the key x with
a non-null value (i.e. cache.containsKey(cache.iterator().next().getKey()) returns false).

resetLostPartitions probably has no effect in this case?

> Partition may restore wrong MOVING state during crash recovery
> --------------------------------------------------------------
>
>                 Key: IGNITE-10226
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10226
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.4
>            Reporter: Pavel Kovalenko
>            Assignee: Pavel Kovalenko
>            Priority: Major
>             Fix For: 2.8
>
>
> The way to get it exists only in versions that don't have IGNITE-9420:
> 1) Start cache, upload some data to partitions, forceCheckpoint
> 2) Start uploading additional data. Kill node. Node should be killed with skipping last
checkpoint, or during checkpoint mark phase.
> 3) Re-start node. The crash recovery process for partitions started. When we create partition
during crash recovery (topology().forceCreatePartition()) we log it's initial state to WAL.
If we have any logical update relates to partition we'll log wrong MOVING state to the end
of current WAL. This state will be considered as last valid when we process PartitionMetaStateRecord
record's during logical recovery. In "restorePartitionsState" phase this state will be chosen
as final and the partition will change to MOVING, even in page memory it has OWNING or something
else.
> To fix this problem in 2.4 - 2.7 versions, additional logging partition state change
to WAL during crash recovery (logical recovery) should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message