ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Lukyanov <stanlukya...@gmail.com>
Subject RE: Ignite index corruption issue -> unrecoverable cluster
Date Thu, 07 Feb 2019 14:09:47 GMT
Denis,

When an index is corrupted you just need to remove index.bin file of the affected cache.
After that, when the node starts it will rebuild the indexes. 
The performance of the SQL queries will be low until the index is rebuilt, so you need to
be cautious.

The main problem is to understand that the indexes are corrupted.
Usually one needs to analyze the exception stack trace to find this out,
and it requires some familiarity with Ignite code base.

The TODO lists I can come up with are:

# Recovering from an index corruption
## Applicable if
It is known that an index of a cache is corrupted, but the main data (partition files and
WAL) is fine.

## Steps to recover
1. Stop the node
2. Delete index.bin of the affected caches (path is db/<consistent_id>/cache-<cache_name>/index.bin)
3. Start the node
- Note: At this point the node is active in the cluster but don’t have indexes. 
It means that it serves SQL queries but their performance can be low.
Avoid running SQL queries on large tables at this point
4. Wait for message “Finished indexes rebuilding for cache <cache_name>” in the
Ignite log

# Recovering from a persistent storage corruption
## Applicable if
A part of the persistent storage (partition files, checkpoint markers or WAL) was corrupted
and there is no other way to recover it, but there are healthy copies of all data on other
nodes.

## Steps to recover
1. Stop the node
2. Delete all persistence files of the node (best to clear Ignite working directory, storage
directory, WAL and WAL archive directories)
3. Make sure consistentId is explicitly set in the configuration of the node
- If it isn’t, lookup the generated consistentId using control.sh and set it explicitly
in the config or via IGNITE_CONSISTENT_ID (2.8+ only)
4. Start the node
5. Wait for messages <Finished rebalancing cache> for all caches


We could have more fine-grained ways to handle data corruption once we address issues from
the
“Stating with missing PDS pieces” thread, create a WAL and/or partition files recovery
tool,
allow to have records in WAL for a missing cache (say, we deleted corrupted files of a single
cache), etc.

Stan

From: Denis Magda
Sent: 7 февраля 2019 г. 3:12
To: dev; Stanislav Lukyanov
Subject: Re: Ignite index corruption issue -> unrecoverable cluster

Stan,

Thanks for staring "Starting with missing PDS pieces" that is promising to
embed usability changes into the source code. In the meantime, could you
propose a TODO list for recovering from index corruption and similar
scenarios? I know that you're experienced in that and it will be great to
document the procedures until the code is modified.

-
Denis


On Wed, Jan 30, 2019 at 1:02 PM Denis Magda <dmagda@apache.org> wrote:

> Dmitry,
>
> Thanks, the FAQ section might make sense but, as the practice shows, it's
> hard to get recommendations even for questions like this one :)
>
> Ignite experts, please chime in, the project fails with data corruption
> periodically and we have to explain how to come around until an issue is
> resolved.
>
> -
> Denis
>
>
> On Wed, Jan 30, 2019 at 11:55 AM Dmitriy Pavlov <dpavlov@apache.org>
> wrote:
>
>> Denis,
>>
>> BTW one case of corruption is fixed here,
>> https://issues.apache.org/jira/browse/IGNITE-11030
>>
>> I still need a review from Ignite Native Persistence Experts. I feel it is
>> really important to apply such fixes.
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 24 янв. 2019 г. в 16:29, Dmitriy Pavlov <dpavlov@apache.org>:
>>
>> > Denis, Whan do you think about a more general idea of creating FAQs for
>> > Ignite users?
>> >
>> > What if experts will once place their answer in a wiki page and then
>> > develop answers for frequent problems.
>> >
>> > And before diving into researching each problem, experienced community
>> > members will ask users to check the FAQ first?
>> >
>> > Sincerely,
>> > Dmitriy Pavlov
>> >
>> > P.S. here is an article, Apache guides have reference to
>> > http://www.catb.org/~esr/faqs/smart-questions.html - one from required
>> > actions from users is to search for information.
>> >
>> > чт, 24 янв. 2019 г. в 01:55, Denis Magda <dmagda@gridgain.com>:
>> >
>> >> Another data/index corruption issue:
>> >>
>> >>
>> https://stackoverflow.com/questions/54295401/ignite-transaction-failure-not-recoverable-with-persistance
>> >>
>> >> It's suggested to clean index.bin to be able to recover the cluster.
>> >> Folks,
>> >> let's prepare a list of actions to do if a cluster becomes
>> unrecoverable
>> >> due to data or index corruption issue. What should we do depending on
>> an
>> >> exception:
>> >>
>> >>    - Remove index.bin if X or Y or Z
>> >>    - etc
>> >>
>> >>
>> >> --
>> >> Denis Magda
>> >>
>> >>
>> >> On Sun, Dec 30, 2018 at 10:06 AM Denis Magda <dmagda@gridgain.com>
>> wrote:
>> >>
>> >> > Ignite SQL and memory experts,
>> >> >
>> >> > The following issue was reported on SO:
>> >> >
>> >> >
>> >>
>> https://stackoverflow.com/questions/53979106/ignite-corruptedtreeexception-leads-to-cluster-failure
>> >> >
>> >> > The stack trace starts with the message below, more details are in
>> that
>> >> > forum:
>> >> >
>> >> > [SEVERE][data-streamer-stripe-2-#15][GridDhtAtomicCache] <MyCache>
>> >> > Unexpected exception during cache update
>> >> > org.h2.message.DbException: General error: "class
>> >> >
>> >>
>> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>> >> > Runtime failure on row: Row@75ab6623[ key: CacheKey
>> [idHash=242632156,
>> >> > hash=-841684964, parentId=-8607237606486310912, hour=9,
>> >> > id=-8607237528489033728, date=2018-09-09 00:00:00.0], val: CacheValue
>> >> > [idHash=843227122, hash=-801894604, ....
>> >> >
>> >> > Let's see if it's addressed in the latest release. Also, the user
>> asked
>> >> a
>> >> > reasonable question - how to recover? Yes, it's possible to use
>> >> snapshots
>> >> > of GridGain if they are created before but I remember some
>> discussions
>> >> > around a recovery tool.
>> >> >
>> >> > --
>> >> > Denis
>> >> >
>> >>
>> >
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message