From dev-return-44609-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Feb 7 14:33:52 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E762A180600 for ; Thu, 7 Feb 2019 15:33:51 +0100 (CET) Received: (qmail 86662 invoked by uid 500); 7 Feb 2019 14:33:51 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 86627 invoked by uid 99); 7 Feb 2019 14:33:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2019 14:33:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E5778181951 for ; Thu, 7 Feb 2019 14:33:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id U20GR8COJzV5 for ; Thu, 7 Feb 2019 14:33:46 +0000 (UTC) Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 1BE0D5F532 for ; Thu, 7 Feb 2019 14:09:50 +0000 (UTC) Received: by mail-lj1-f180.google.com with SMTP id c19-v6so9411162lja.5 for ; Thu, 07 Feb 2019 06:09:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:mime-version:to:from:subject:date:importance:in-reply-to :references; bh=HdZb9c6aEy+3eEJ+9YRMOii+mTFsAPB4LBiTY2FT9j4=; b=arJB7pfrFVQNESkh3Yjai9KQjy44q+snMltwOPpdxrhClsoyxXfIyi1JKMkawbKQTt s5zJdma/FSPZ0zNH/Q/Ry0HlhqxkYqoiAIF5mf/QYFikjg+u378Y8XUPWbP3ZCbe3QF8 Q8wACKMBsqWt1wTpqifvq3s67Y0U3Z3u8J6hoduJiHzkoaWn9/PL9vBUdnuJcCAg6M/k YwxeCJ5SHKbikKBybubtR4dAuc2b2xCseXa88bvhi/HjEU5XRyRBjxkB67EAyPV9p0+k 9ql7ouim8CDBELwozBWe+5jFePIfEKEYXuIvuMt0huS+5MiVcj5o5U5Hrs0ymAh5BFLC on5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:mime-version:to:from:subject:date :importance:in-reply-to:references; bh=HdZb9c6aEy+3eEJ+9YRMOii+mTFsAPB4LBiTY2FT9j4=; b=EZRphFUv35gB9uj319sykJqyTPeD3CAJZNyyetRlukQOhpuF8xXm/IyVW7a5LLNLZ0 XuIxlOFr22UoVCGpSWxdcTYFhF+iYL0egwkg7yjMlkDBt6h5ROynzF8XAiAC2+ZoWwCy YJmI988G6mWOVXD/nWdQzcswVEmjQsqwW6lAcFTfZBconEPu8Owdm7e3qcpMw0LI6YLR Q5sHs/mWS7rzgdsHlIaJiQfAQGT+lVqminhlDw9JUNl6v7WrE7U6oO8RVwhqaD0m9NVT QIBcqFEMe4CVZY6INibSVCUsBURl84Df9dsRH/ERaqwcH4NyKaitLjV6WEOQpT71M5Wu 4cEA== X-Gm-Message-State: AHQUAuZ1rPJ95y4ftyZyCyslb2eGsBPnUoKjbWgnwa2zngMKUYZxJoom 8QEIn8ByC/d4apgED618FzpuglLI X-Google-Smtp-Source: AHgI3IZJAKGG7+bhLnEeVmW2Sh8rcwXgzTSHNezcWXw5kRi7r6FSnsjQS1WsGCTzNN0H0Q6S7yqdtg== X-Received: by 2002:a2e:21a9:: with SMTP id h41-v6mr10006285lji.103.1549548588248; Thu, 07 Feb 2019 06:09:48 -0800 (PST) Received: from ?IPv6:::ffff:172.25.4.106? ([195.239.208.174]) by smtp.gmail.com with ESMTPSA id v9sm4903657lfg.15.2019.02.07.06.09.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 06:09:47 -0800 (PST) Message-ID: <5c5c3c2b.1c69fb81.6d973.14ff@mx.google.com> MIME-Version: 1.0 To: "dev@ignite.apache.org" From: Stanislav Lukyanov Subject: RE: Ignite index corruption issue -> unrecoverable cluster Date: Thu, 7 Feb 2019 17:09:47 +0300 Importance: normal X-Priority: 3 In-Reply-To: References: Content-Type: multipart/alternative; boundary="_40637DB3-240A-4A35-832D-4E9AEBC6382F_" --_40637DB3-240A-4A35-832D-4E9AEBC6382F_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Denis, When an index is corrupted you just need to remove index.bin file of the af= fected cache. After that, when the node starts it will rebuild the indexes.=20 The performance of the SQL queries will be low until the index is rebuilt, = so you need to be cautious. The main problem is to understand that the indexes are corrupted. Usually one needs to analyze the exception stack trace to find this out, and it requires some familiarity with Ignite code base. The TODO lists I can come up with are: # Recovering from an index corruption ## Applicable if It is known that an index of a cache is corrupted, but the main data (parti= tion files and WAL) is fine. ## Steps to recover 1. Stop the node 2. Delete index.bin of the affected caches (path is db//cach= e-/index.bin) 3. Start the node - Note: At this point the node is active in the cluster but don=E2=80=99t h= ave indexes.=20 It means that it serves SQL queries but their performance can be low. Avoid running SQL queries on large tables at this point 4. Wait for message =E2=80=9CFinished indexes rebuilding for cache =E2=80=9D in the Ignite log # Recovering from a persistent storage corruption ## Applicable if A part of the persistent storage (partition files, checkpoint markers or WA= L) was corrupted and there is no other way to recover it, but there are healthy copies of al= l data on other nodes. ## Steps to recover 1. Stop the node 2. Delete all persistence files of the node (best to clear Ignite working d= irectory, storage directory, WAL and WAL archive directories) 3. Make sure consistentId is explicitly set in the configuration of the nod= e - If it isn=E2=80=99t, lookup the generated consistentId using control.sh a= nd set it explicitly in the config or via IGNITE_CONSISTENT_ID (2.8+ only) 4. Start the node 5. Wait for messages for all caches We could have more fine-grained ways to handle data corruption once we addr= ess issues from the =E2=80=9CStating with missing PDS pieces=E2=80=9D thread, create a WAL and/= or partition files recovery tool, allow to have records in WAL for a missing cache (say, we deleted corrupted= files of a single cache), etc. Stan From: Denis Magda Sent: 7 =D1=84=D0=B5=D0=B2=D1=80=D0=B0=D0=BB=D1=8F 2019 =D0=B3. 3:12 To: dev; Stanislav Lukyanov Subject: Re: Ignite index corruption issue -> unrecoverable cluster Stan, Thanks for staring "Starting with missing PDS pieces" that is promising to embed usability changes into the source code. In the meantime, could you propose a TODO list for recovering from index corruption and similar scenarios? I know that you're experienced in that and it will be great to document the procedures until the code is modified. - Denis On Wed, Jan 30, 2019 at 1:02 PM Denis Magda wrote: > Dmitry, > > Thanks, the FAQ section might make sense but, as the practice shows, it's > hard to get recommendations even for questions like this one :) > > Ignite experts, please chime in, the project fails with data corruption > periodically and we have to explain how to come around until an issue is > resolved. > > - > Denis > > > On Wed, Jan 30, 2019 at 11:55 AM Dmitriy Pavlov > wrote: > >> Denis, >> >> BTW one case of corruption is fixed here, >> https://issues.apache.org/jira/browse/IGNITE-11030 >> >> I still need a review from Ignite Native Persistence Experts. I feel it = is >> really important to apply such fixes. >> >> Sincerely, >> Dmitriy Pavlov >> >> =D1=87=D1=82, 24 =D1=8F=D0=BD=D0=B2. 2019 =D0=B3. =D0=B2 16:29, Dmitriy = Pavlov : >> >> > Denis, Whan do you think about a more general idea of creating FAQs fo= r >> > Ignite users? >> > >> > What if experts will once place their answer in a wiki page and then >> > develop answers for frequent problems. >> > >> > And before diving into researching each problem, experienced community >> > members will ask users to check the FAQ first? >> > >> > Sincerely, >> > Dmitriy Pavlov >> > >> > P.S. here is an article, Apache guides have reference to >> > http://www.catb.org/~esr/faqs/smart-questions.html - one from required >> > actions from users is to search for information. >> > >> > =D1=87=D1=82, 24 =D1=8F=D0=BD=D0=B2. 2019 =D0=B3. =D0=B2 01:55, Denis = Magda : >> > >> >> Another data/index corruption issue: >> >> >> >> >> https://stackoverflow.com/questions/54295401/ignite-transaction-failure-= not-recoverable-with-persistance >> >> >> >> It's suggested to clean index.bin to be able to recover the cluster. >> >> Folks, >> >> let's prepare a list of actions to do if a cluster becomes >> unrecoverable >> >> due to data or index corruption issue. What should we do depending on >> an >> >> exception: >> >> >> >> - Remove index.bin if X or Y or Z >> >> - etc >> >> >> >> >> >> -- >> >> Denis Magda >> >> >> >> >> >> On Sun, Dec 30, 2018 at 10:06 AM Denis Magda >> wrote: >> >> >> >> > Ignite SQL and memory experts, >> >> > >> >> > The following issue was reported on SO: >> >> > >> >> > >> >> >> https://stackoverflow.com/questions/53979106/ignite-corruptedtreeexcepti= on-leads-to-cluster-failure >> >> > >> >> > The stack trace starts with the message below, more details are in >> that >> >> > forum: >> >> > >> >> > [SEVERE][data-streamer-stripe-2-#15][GridDhtAtomicCache] >> >> > Unexpected exception during cache update >> >> > org.h2.message.DbException: General error: "class >> >> > >> >> >> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTr= eeException: >> >> > Runtime failure on row: Row@75ab6623[ key: CacheKey >> [idHash=3D242632156, >> >> > hash=3D-841684964, parentId=3D-8607237606486310912, hour=3D9, >> >> > id=3D-8607237528489033728, date=3D2018-09-09 00:00:00.0], val: Cach= eValue >> >> > [idHash=3D843227122, hash=3D-801894604, .... >> >> > >> >> > Let's see if it's addressed in the latest release. Also, the user >> asked >> >> a >> >> > reasonable question - how to recover? Yes, it's possible to use >> >> snapshots >> >> > of GridGain if they are created before but I remember some >> discussions >> >> > around a recovery tool. >> >> > >> >> > -- >> >> > Denis >> >> > >> >> >> > >> > --_40637DB3-240A-4A35-832D-4E9AEBC6382F_--