ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Magda (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-1239) Cache partition iterator throws exception when concurrent rebalancing is running
Date Wed, 12 Aug 2015 14:01:46 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693515#comment-14693515
] 

Denis Magda edited comment on IGNITE-1239 at 8/12/15 2:01 PM:
--------------------------------------------------------------

Created a test to reproduce the bug.
The test initially start 3 nodes. When all 3 nodes are ready the test starts executing per-partition
queries from one Thread and launching/shutting down additional nodes from additional multiple
Threads.

Could reproduce "Partition can't be reserved" error only periodically.

But the test exposed one more bug - it was constantly hanging on partitions exchange. Spent
almost all the day for issue debugging.
The hanging was caused by the fact that a scan iterator wasn't closed for local partitions
of the node that initially executed query. This affected rebalancing when a new node joined
or left topology.
Fixed this issue in {{GridCacheQueryManager}}, the test doesn't hang any more.

"Partition can't be reserved" error no longer reproduced as well. However, will try to improve
the test to double-check that the issue has been fixed as well.

The fix for the issue with local scan iterators can be review in the attached patch.


was (Author: dmagda):
Created a test to reproduce the bug.
The test initially start 3 nodes. When all 3 nodes are ready the test starts executing per-partition
queries from one Thread and launching/shutting down additional nodes from additional multiple
Threads.

Could reproduce "Partition can't be reserved" error only periodically.

But the test exposed one more bug - it was constantly hanging on partitions exchange. Spent
almost all the day for issue debugging.
The hanging was caused by the fact that a scan iterator wasn't closed for local partitions
of the node that initially executed query. This affected rebalancing when a new node joined
or left topology.
Fixed this issue in {{GridCacheQueryManager}}, the test doesn't hang any more.

"Partition can't be reserved" error no longer reproduced. However, will try to improve the
test to double-check that the issue has been fixed as well.

The fix for the issue with local scan iterators can be review in the attached patch.

> Cache partition iterator throws exception when concurrent rebalancing is running
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-1239
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1239
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>            Reporter: Alexey Goncharuk
>            Assignee: Denis Magda
>
> I observed this exception when IgniteRDD was iterating over partition and two new nodes
have joined:
> {code}
> Caused by: class org.apache.ignite.IgniteCheckedException: Query execution failed: GridCacheQueryBean
[qry=GridCacheQueryAdapter [type=SCAN, clsName=null, clause=null, filter=org.apache.ignite.internal.processors.cache.IgniteCacheProxy$1@6490c94c,
part=138, incMeta=false, metrics=GridCacheQueryMetricsAdapter [minTime=10, maxTime=10, avgTime=10.0,
execs=1, fails=1, executed=true], pageSize=1024, timeout=0, keepAll=true, incBackups=false,
dedup=false, prj=null, keepPortable=false, subjId=9cdc9751-c6ec-43eb-968a-e941f2a1a8cd, taskHash=0],
rdc=null, trans=null]
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.checkError(GridCacheQueryFutureAdapter.java:245)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.internalIterator(GridCacheQueryFutureAdapter.java:303)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.next(GridCacheQueryFutureAdapter.java:156)
> 	... 17 more
> Caused by: class org.apache.ignite.IgniteCheckedException: Failed to execute query on
node [query=GridCacheQueryBean [qry=GridCacheQueryAdapter [type=SCAN, clsName=null, clause=null,
filter=org.apache.ignite.internal.processors.cache.IgniteCacheProxy$1@6490c94c, part=138,
incMeta=false, metrics=GridCacheQueryMetricsAdapter [minTime=0, maxTime=0, avgTime=0.0, execs=0,
fails=0, executed=false], pageSize=1024, timeout=0, keepAll=true, incBackups=false, dedup=false,
prj=null, keepPortable=false, subjId=9cdc9751-c6ec-43eb-968a-e941f2a1a8cd, taskHash=0], rdc=null,
trans=null], nodeId=963d0e35-7805-4b6d-8d64-22cce84e35f2]
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.onPage(GridCacheQueryFutureAdapter.java:370)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.processQueryResponse(GridCacheDistributedQueryManager.java:377)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.access$000(GridCacheDistributedQueryManager.java:44)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$1.apply(GridCacheDistributedQueryManager.java:74)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$1.apply(GridCacheDistributedQueryManager.java:72)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:534)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:240)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:48)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1026)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2256)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:946)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$1700(GridIoManager.java:60)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:915)
> 	... 3 more
> Caused by: class org.apache.ignite.IgniteCheckedException: Partition can't be reserved
> 	at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:6808)
> {code}
> The issue is that query request was sent on a backup node and by the time request has
arrived, the partition was already evicted, which resulted in "Partition cannot be reserved"
exception. We should automatically retry if this exception is encountered.
> I believe we have logic that retries, but it looks like there is a bug in that logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message