From dev-return-32114-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Mar 15 17:34:54 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5E332180654 for ; Thu, 15 Mar 2018 17:34:53 +0100 (CET) Received: (qmail 51760 invoked by uid 500); 15 Mar 2018 16:34:52 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 50946 invoked by uid 99); 15 Mar 2018 16:34:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2018 16:34:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 17AFE1A0354; Thu, 15 Mar 2018 16:34:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 4tLuXOoPQHa4; Thu, 15 Mar 2018 16:34:46 +0000 (UTC) Received: from mail-io0-f176.google.com (mail-io0-f176.google.com [209.85.223.176]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 95E235F1B3; Thu, 15 Mar 2018 16:34:46 +0000 (UTC) Received: by mail-io0-f176.google.com with SMTP id g14so406652iob.13; Thu, 15 Mar 2018 09:34:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=c//fnXgQE1m8TuCHzrkNfyrModQR7eFJoTNniyo77Z4=; b=m+wtv9fbYgvh4XMDXEJaDj7Yvui78VrYH10lbXhD2erHrPM3Gc6D6VMi/FimH8Xba5 0mJgXquorzPTZXTopuKzxF30ZdZSNm07285MHE3CmO5q6Os2VjxGA7sDdIwE6nGbyZr8 HNd4M+Shor2sbPlQU+n+b1F/uDaKu3RQoC1U5yUPWkZG7RV9qPfDuPddUGo5w3dQ1N9h Xy5Q7w/WBjdvZoxl+rI5VOg2ETgimHOirZgvnYwvrcoeMcwuYFTHKCgDkRWV4vu6HylV wMbJPhQ1rPaXzponVfQx67UZiw5iUPnpOYEcXt7/mMJfMU5/fEbkTZSL+Mnune+7aAnf mL8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=c//fnXgQE1m8TuCHzrkNfyrModQR7eFJoTNniyo77Z4=; b=E/fpC1Q1E7KhSXicmzCMCFq0R0ZFVdeI57xzr40FZmAjbe5P3Wsf2PkJmMDZ4YDYEX aKbRk4OSMAMWUenc95bKMFds0p/r6C0hFA4VFRIfNeQIvLVc6NsHe+gOLF/Cra8xlA39 h0FmSqGJsjJ2TM65nH9xM5FPo/pHfY+pYCTdZz0yfBAhnEJU/FvI5ArIkj/jG9eOM8rn 56esNIbTK9gHKgvAZKsK43EvC2RowHrZhqOwNtZvarIz0MQaXaNgza7FC7OxNtQ1d8gS ISxKlJBnQqA7VgHKRbyQ4VNlQV6eH6wp3E03V+QUHU8yRxhNU86u3crnWHaQb3CCuESw j2dw== X-Gm-Message-State: AElRT7Ev/2+eX3IVOkHn2F0Wdlm7YoHXXTExczEPu4xH3I2j6AScOilR o7TOT9MnKk47Y0Ob8rV49qZrMLJ2Y6i78AFd7Jo= X-Google-Smtp-Source: AG47ELuWzLoqCSux/nFyiHqITaze8rWAs6dPMAUUQpi94yUX9S/8WMJ0auFkW2GZjRqI+HROcS6Ee3Az6z4pw/3BFIQ= X-Received: by 10.107.22.132 with SMTP id 126mr9675132iow.63.1521131685752; Thu, 15 Mar 2018 09:34:45 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dmitry Pavlov Date: Thu, 15 Mar 2018 16:34:35 +0000 Message-ID: Subject: Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)) To: user@ignite.apache.org, Alexey Goncharuk Cc: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="94eb2c05adfcc939d8056776110d" --94eb2c05adfcc939d8056776110d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Alexey, It may be serious issue. Could you recommend expert here who can pick up this? Sincerely, Dmitriy Pavlov =D1=87=D1=82, 15 =D0=BC=D0=B0=D1=80. 2018 =D0=B3. =D0=B2 19:25, Arseny Kova= lchuk : > Hi, guys. > > I've got a reproducer for a problem which is generally reported as "Cause= d > by: java.lang.IllegalStateException: Failed to get page IO instance (page > content is corrupted)". Actually it reproduces the result. I don't have a= n > idea how the data has been corrupted, but the cluster node doesn't want t= o > start with this data. > > We got the issue again when some of server nodes were restarted several > times by kubernetes. I suspect that the data got corrupted during such > restarts. But the main functionality that we really desire to have, that > the cluster DOESN'T HANG during next restart even if the data is corrupte= d! > Anyway, there is no a tool that can help to correct such data, and as a > result we wipe all data manually to start the cluster. So, having warning= s > about corrupted data in logs and just working cluster is the expected > behavior. > > How to reproduce: > 1. Download the data from here > https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb) > 2. Download and import Gradle project > https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb) > 3. Unpack the data to the home folder, say /home/user1. You should get th= e > path like */home/user1/data5*. Inside data5 you should have binary_meta, > db, marshaller. > 4. Open *src/main/resources/data-test.xml* and put the absolute path of > unpacked data into *workDirectory* property of *igniteCfg5* bean. In this > example it should be */home/user1/data5.* Do not edit consistentId! > The consistentId is ignite-instance-5, so the real data is in > the data5/db/ignite_instance_5 folder > 5. Start application from ru.synesis.kipod.DataTestBootApp > 6. Enjoy > > Hope it will help. > > > =E2=80=8B > Arseny Kovalchuk > > Senior Software Engineer at Synesis > skype: arseny.kovalchuk > mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> > =E2=80=8BLinkedIn Profile = =E2=80=8B > > On 26 December 2017 at 21:15, Denis Magda wrote: > >> Cross-posting to the dev list. >> >> Ignite persistence maintainers please chime in. >> >> =E2=80=94 >> Denis >> > On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk >> wrote: >> >> Hi guys. >> >> Another issue when using Ignite 2.3 with native persistence enabled. See >> details below. >> >> We deploy Ignite along with our services in Kubernetes (v 1.8) on >> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ign= ite >> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. >> >> We put about 230 events/second into Ignite, 70% of events are ~200KB in >> size and 30% are 5000KB. Smaller events have indexed fields and we query >> them via SQL. >> >> The cluster is activated from a client node which also streams events >> into Ignite from Kafka. We use custom implementation of streamer which u= ses >> cache.putAll() API. >> >> We started cluster from scratch without any persistent data. After a >> while we got corrupted data with the error message. >> >> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] >> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.Gr= idDhtPreloader: >> - Partition eviction failed, this can cause grid hang. >> class org.apache.ignite.IgniteException: Runtime failure on search row: >> Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: >> ru.synesis.kipod.event.KipodEvent [idHash=3D510912646, hash=3D-387621419= , >> face_last_name=3Dnull, face_list_id=3Dnull, channel=3D171, source=3D, >> face_similarity=3Dnull, license_plate_number=3Dnull, descriptors=3Dnull, >> cacheName=3Dkipod_events, cacheKey=3D171:1513946618964:3008806055072854, >> stream=3D171, alarm=3Dfalse, processed_at=3D0, face_id=3Dnull, id=3D3008= 806055072854, >> persistent=3Dfalse, face_first_name=3Dnull, license_plate_first_name=3Dn= ull, >> face_full_name=3Dnull, level=3D0, module=3DKpx.Synesis.Outdoor, >> end_time=3D1513946624379, params=3Dnull, commented_at=3D0, tags=3D[vehic= le, 0, >> human, 0, truck, 0, start_time=3D1513946618964, processed=3Dfalse, >> kafka_offset=3D111259, license_plate_last_name=3Dnull, armed=3Dfalse, >> license_plate_country=3Dnull, topic=3DMovingObject, comment=3D, >> expiration=3D1514033024000, original_id=3Dnull, license_plate_lists=3Dnu= ll], ver: >> GridCacheVersion [topVer=3D125430590, order=3D1513955001926, nodeOrder= =3D3] ][ >> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, >> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, >> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, >> null, null, null, null, null, null, null, null ] >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216) >> at org.apache.ignite.internal.pro >> cessors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496) >> at org.apache.ignite.internal.pro >> cessors.query.h2.opt.GridH2Table.update(GridH2Table.java:423) >> at org.apache.ignite.internal.pro >> cessors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580) >> at org.apache.ignite.internal.pro >> cessors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334) >> at org.apache.ignite.internal.pro >> cessors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.j= ava:461) >> at org.apache.ignite.internal.pro >> cessors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRem= ove(IgniteCacheOffheapManagerImpl.java:1453) >> at org.apache.ignite.internal.pro >> cessors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(Ig= niteCacheOffheapManagerImpl.java:1416) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.rem= ove(GridCacheOffheapManager.java:1271) >> at org.apache.ignite.internal.pro >> cessors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapMan= agerImpl.java:374) >> at org.apache.ignite.internal.pro >> cessors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233) >> at org.apache.ignite.internal.pro >> cessors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCac= heEntry.java:588) >> at org.apache.ignite.internal.pro >> cessors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLoca= lPartition.java:951) >> at org.apache.ignite.internal.pro >> cessors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLoca= lPartition.java:809) >> at org.apache.ignite.internal.pro >> cessors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtP= reloader.java:593) >> at org.apache.ignite.internal.pro >> cessors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtP= reloader.java:580) >> at >> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils= .java:6631) >> at org.apache.ignite.internal.pro >> cessors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:96= 7) >> at >> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:11= 0) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav= a:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja= va:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.IllegalStateException: Failed to get page IO >> instance (page content is corrupted) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:= 83) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowA= dapter.java:148) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowA= dapter.java:102) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62) >> at org.apache.ignite.internal.processors.query.h2.database.io >> .H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126) >> at org.apache.ignite.internal.processors.query.h2.database.io >> .H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2Tree.getRow(H2Tree.java:123) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2Tree.getRow(H2Tree.java:40) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2Tree.compare(H2Tree.java:200) >> at org.apache.ignite.internal.pro >> cessors.query.h2.database.H2Tree.compare(H2Tree.java:40) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.ja= va:4279) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.ja= va:4697) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.ja= va:4682) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.jav= a:158) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.DataStructure.read(DataStructure.java:319) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842) >> at org.apache.ignite.internal.pro >> cessors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752) >> ... 23 more >> >> >> After restart we also get this error. See *ignite-instance-2.log*. >> >> The *cache-config.xml* is used for *server* instances. >> The *ignite-common-cache-conf.xml* is used for *client* instances which >> activete cluster and stream data from Kafka into Ignite. >> >> *Is it possible to tune up (or implement) native persistence in a way >> when it just reports about error in data or corrupted data, then skip it >> and continue to work without that corrupted part. Thus it will make the >> cluster to continue operating regardless of errors on storage?* >> >> >> =E2=80=8B >> Arseny Kovalchuk >> >> Senior Software Engineer at Synesis >> skype: arseny.kovalchuk >> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> >> =E2=80=8BLinkedIn Profile =E2=80=8B >> >> >> >> >> >> >> --94eb2c05adfcc939d8056776110d--