ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: GridDhtInvalidPartitionException takes the cluster down
Date Thu, 31 Oct 2019 02:13:27 GMT

Let me restart this conversation again. I regularly come across discussions
where the users ask to explain how to deal with exceptions generated by the
failure handlers. Here is the fresh one for our reference:

Even though that exception in that SO thread is not the root cause, it
confused the Ignite users who has no idea how to treat it and proceed with

Let's either adjust the message format or show this message only if DEBUG
mode is enabled. Thoughts? Please check my previous response, where some
thoughts about the new format were shared.


On Mon, Apr 15, 2019 at 7:27 PM Denis Magda <dmagda@apache.org> wrote:

> Alright, it took me longer to get back and look into it. Sorry for a
> delay. Overall, folks, the things look creepy, seriously. I see 3 primary
> issues ranged by priority.
> 1st, until the failure handler gets smart enough how to deal with
> avoid false-positives and print out a warning message instead of stopping a
> node. *Andrey*, that's the new behavior of 2.7.5 release according to JIRA,
> right?
> 2nd, the format of the warning/exception message doesn't give any hints
> for troubleshooting nor a clue why this happened. I have no idea what to
> suggest to those who see exceptions of this kind [1] and have to call for
> help from Andrey and other committers. For instance, if to take [1] as a
> reference
> Critical system error detected. Will be handled accordingly to configured handler
> [hnd=StopNodeOrHaltFailureHandler [*tryStop*=false, *timeout*=0, super=AbstractFailureHandler
> [*ignoredFailureTypes*=[SYSTEM_WORKER_BLOCKED]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException:
> GridWorker [name=grid-timeout-worker, igniteInstanceName=TravelInventoryTesting,
> finished=false, *heartbeatTs*=1553481506244]]] class
> org.apache.ignite.IgniteException: GridWorker [name=grid-timeout-worker,
> igniteInstanceName=TravelInventoryTesting,
> finished=false, *heartbeatTs*=1553481506244]
> A lot of the details might be hidden but, unfortunately, but the
> interpretation of parameters like heartbeatTs, tryStop,  finished, timeout,
> etc is hard. Seems like a message which has to be fed into a complementary
> tool which will give me an answer. The format of the message has to help
> the user (developer/devops/administrator/architect who has zero affiliation
> with the Ignite community) with troubleshooting without calling for help on
> the user list:
>    - What happened - out of memory/critical error/hanging threads. We're
>    already pretty good at that.
>    - Why this happened - supply context in human language. For instance,
>    "discovery thread was not responding within N seconds because of starvation
>    or long GC pause."
>    - Troubleshooting guidance - help the user to come around the issue.
>    For instance, "Check your GC logs, ensure that compute tasks are not
>    oversaturating CPUs causing livelocks. Tune parameter Y and Z."
> Would you see anything else? Let's design and enhance.
> 3rd, full cluster shutdown. Agree, that's harder. Do we have stats when it
> usually happens?
> [1]
> http://apache-ignite-users.70518.x6.nabble.com/Replace-or-Put-after-PutAsync-causes-Ignite-to-hang-td27871.html#a27873
> -
> Denis
> On Sat, Apr 6, 2019 at 11:37 AM Me via Boomerang <dmagda@gridgain.com>
> wrote:
>> Message moved to top of Inbox by Boomerang (view this conversation
>> <https://mail.google.com/mail/u/0/#search/rfc822msgid:%3CCAK0qHnq%3D%3DP_gzftAW3-dTe3j%3DvJo295cFSd%2BLQM43S96vKv3ng%40mail.gmail.com%3E>
>> ).
>> Don't want this notification email in the future? Go to
>> https://b4g.baydin.com/settings and uncheck the 'At the top of your
>> Inbox' option under Settings. Please note that your Boomeranged messages
>> would no longer return to the top of your Inbox.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message