ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: GridDhtInvalidPartitionException takes the cluster down
Date Mon, 15 Apr 2019 23:27:38 GMT
Alright, it took me longer to get back and look into it. Sorry for a delay.
Overall, folks, the things look creepy, seriously. I see 3 primary issues
ranged by priority.

1st, until the failure handler gets smart enough how to deal with
avoid false-positives and print out a warning message instead of stopping a
node. *Andrey*, that's the new behavior of 2.7.5 release according to JIRA,

2nd, the format of the warning/exception message doesn't give any hints for
troubleshooting nor a clue why this happened. I have no idea what to
suggest to those who see exceptions of this kind [1] and have to call for
help from Andrey and other committers. For instance, if to take [1] as a

Critical system error detected. Will be handled accordingly to
configured handler
[hnd=StopNodeOrHaltFailureHandler [*tryStop*=false, *timeout*=0,
failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
GridWorker [name=grid-timeout-worker,
finished=false, *heartbeatTs*=1553481506244]]] class
org.apache.ignite.IgniteException: GridWorker [name=grid-timeout-worker,
finished=false, *heartbeatTs*=1553481506244]

A lot of the details might be hidden but, unfortunately, but the
interpretation of parameters like heartbeatTs, tryStop,  finished, timeout,
etc is hard. Seems like a message which has to be fed into a complementary
tool which will give me an answer. The format of the message has to help
the user (developer/devops/administrator/architect who has zero affiliation
with the Ignite community) with troubleshooting without calling for help on
the user list:

   - What happened - out of memory/critical error/hanging threads. We're
   already pretty good at that.
   - Why this happened - supply context in human language. For instance,
   "discovery thread was not responding within N seconds because of starvation
   or long GC pause."
   - Troubleshooting guidance - help the user to come around the issue. For
   instance, "Check your GC logs, ensure that compute tasks are not
   oversaturating CPUs causing livelocks. Tune parameter Y and Z."

Would you see anything else? Let's design and enhance.

3rd, full cluster shutdown. Agree, that's harder. Do we have stats when it
usually happens?



On Sat, Apr 6, 2019 at 11:37 AM Me via Boomerang <dmagda@gridgain.com>

> Message moved to top of Inbox by Boomerang (view this conversation
> <https://mail.google.com/mail/u/0/#search/rfc822msgid:%3CCAK0qHnq%3D%3DP_gzftAW3-dTe3j%3DvJo295cFSd%2BLQM43S96vKv3ng%40mail.gmail.com%3E>
> ).
> Don't want this notification email in the future? Go to
> https://b4g.baydin.com/settings and uncheck the 'At the top of your
> Inbox' option under Settings. Please note that your Boomeranged messages
> would no longer return to the top of your Inbox.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message