ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arseny Kovalchuk <arseny.kovalc...@synesis.ru>
Subject Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))
Date Thu, 15 Mar 2018 16:25:30 GMT
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused
by: java.lang.IllegalStateException: Failed to get page IO instance (page
content is corrupted)". Actually it reproduces the result. I don't have an
idea how the data has been corrupted, but the cluster node doesn't want to
start with this data.

We got the issue again when some of server nodes were restarted several
times by kubernetes. I suspect that the data got corrupted during such
restarts. But the main functionality that we really desire to have, that
the cluster DOESN'T HANG during next restart even if the data is corrupted!
Anyway, there is no a tool that can help to correct such data, and as a
result we wipe all data manually to start the cluster. So, having warnings
about corrupted data in logs and just working cluster is the expected
behavior.

How to reproduce:
1. Download the data from here
https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project
https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the
path like */home/user1/data5*. Inside data5 you should have binary_meta,
db, marshaller.
4. Open *src/main/resources/data-test.xml* and put the absolute path of
unpacked data into *workDirectory* property of *igniteCfg5* bean. In this
example it should be */home/user1/data5.* Do not edit consistentId!
The consistentId is ignite-instance-5, so the real data is in
the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.


​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 26 December 2017 at 21:15, Denis Magda <dmagda@apache.org> wrote:

> Cross-posting to the dev list.
>
> Ignite persistence maintainers please chime in.
>
> —
> Denis
>
> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <arseny.kovalchuk@synesis.ru>
> wrote:
>
> Hi guys.
>
> Another issue when using Ignite 2.3 with native persistence enabled. See
> details below.
>
> We deploy Ignite along with our services in Kubernetes (v 1.8) on
> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>
> We put about 230 events/second into Ignite, 70% of events are ~200KB in
> size and 30% are 5000KB. Smaller events have indexed fields and we query
> them via SQL.
>
> The cluster is activated from a client node which also streams events into
> Ignite from Kafka. We use custom implementation of streamer which uses
> cache.putAll() API.
>
> We started cluster from scratch without any persistent data. After a while
> we got corrupted data with the error message.
>
> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%]
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader:
> - Partition eviction failed, this can cause grid hang.
> class org.apache.ignite.IgniteException: Runtime failure on search row:
> Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val:
> ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419,
> face_last_name=null, face_list_id=null, channel=171, source=,
> face_similarity=null, license_plate_number=null, descriptors=null,
> cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854,
> stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854,
> persistent=false, face_first_name=null, license_plate_first_name=null,
> face_full_name=null, level=0, module=Kpx.Synesis.Outdoor,
> end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0,
> human, 0, truck, 0, start_time=1513946618964, processed=false,
> kafka_offset=111259, license_plate_last_name=null, armed=false,
> license_plate_country=null, topic=MovingObject, comment=,
> expiration=1514033024000, original_id=null, license_plate_lists=null], ver:
> GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][
> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964,
> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259,
> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null,
> null, null, null, null, null, null, null, null ]
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.doRemove(BPlusTree.java:1787)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.remove(BPlusTree.java:1578)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> eeIndex.remove(H2TreeIndex.java:216)
> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
> le.doUpdate(GridH2Table.java:496)
> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
> le.update(GridH2Table.java:423)
> at org.apache.ignite.internal.processors.query.h2.IgniteH2Index
> ing.remove(IgniteH2Indexing.java:580)
> at org.apache.ignite.internal.processors.query.GridQueryProcess
> or.remove(GridQueryProcessor.java:2334)
> at org.apache.ignite.internal.processors.cache.query.GridCacheQ
> ueryManager.remove(GridCacheQueryManager.java:461)
> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
> apManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOff
> heapManagerImpl.java:1453)
> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
> apManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapMa
> nagerImpl.java:1416)
> at org.apache.ignite.internal.processors.cache.persistence.Grid
> CacheOffheapManager$GridCacheDataStore.remove(GridCacheOffhe
> apManager.java:1271)
> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
> apManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
> at org.apache.ignite.internal.processors.cache.GridCacheMapEntr
> y.removeValue(GridCacheMapEntry.java:3233)
> at org.apache.ignite.internal.processors.cache.distributed.dht.
> GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
> at org.apache.ignite.internal.processors.cache.distributed.dht.
> GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
> at org.apache.ignite.internal.processors.cache.distributed.dht.
> GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
> at org.apache.ignite.internal.processors.cache.distributed.dht.
> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
> at org.apache.ignite.internal.processors.cache.distributed.dht.
> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
> at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader
> (IgniteUtils.java:6631)
> at org.apache.ignite.internal.processors.closure.GridClosurePro
> cessor$2.body(GridClosureProcessor.java:967)
> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWo
> rker.java:110)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: Failed to get page IO
> instance (page content is corrupted)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .io.IOVersions.forVersion(IOVersions.java:83)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .io.IOVersions.forPage(IOVersions.java:95)
> at org.apache.ignite.internal.processors.cache.persistence.Cach
> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
> at org.apache.ignite.internal.processors.cache.persistence.Cach
> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
> at org.apache.ignite.internal.processors.query.h2.database.H2Ro
> wFactory.getRow(H2RowFactory.java:62)
> at org.apache.ignite.internal.processors.query.h2.database.io.
> H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
> at org.apache.ignite.internal.processors.query.h2.database.io.
> H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> ee.getRow(H2Tree.java:123)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> ee.getRow(H2Tree.java:40)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.getRow(BPlusTree.java:4372)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> ee.compare(H2Tree.java:200)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> ee.compare(H2Tree.java:40)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.compare(BPlusTree.java:4359)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.findInsertionPoint(BPlusTree.java:4279)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.access$1500(BPlusTree.java:81)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree$Search.run0(BPlusTree.java:261)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .util.PageHandler.readPage(PageHandler.java:158)
> at org.apache.ignite.internal.processors.cache.persistence.Data
> Structure.read(DataStructure.java:319)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.removeDown(BPlusTree.java:1823)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.removeDown(BPlusTree.java:1842)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.removeDown(BPlusTree.java:1842)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.removeDown(BPlusTree.java:1842)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.doRemove(BPlusTree.java:1752)
> ... 23 more
>
>
> After restart we also get this error. See *ignite-instance-2.log*.
>
> The *cache-config.xml* is used for *server* instances.
> The *ignite-common-cache-conf.xml* is used for *client* instances which
> activete cluster and stream data from Kafka into Ignite.
>
> *Is it possible to tune up (or implement) native persistence in a way when
> it just reports about error in data or corrupted data, then skip it and
> continue to work without that corrupted part. Thus it will make the cluster
> to continue operating regardless of errors on storage?*
>
>
> ​
> Arseny Kovalchuk
>
> Senior Software Engineer at Synesis
> skype: arseny.kovalchuk
> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
> <ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log>
> <ignite-instance-3.log><ignite-instance-4.log><cache-config.xml>
> <ignite-discovery-kubernetes.xml><ignite-common.xml>
> <ignite-common-storage.xml><ignite-common-entity.xml>
>
>
>

Mime
View raw message