ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: GridDhtInvalidPartitionException takes the cluster down
Date Mon, 15 Apr 2019 23:27:38 GMT
Alright, it took me longer to get back and look into it. Sorry for a delay.
Overall, folks, the things look creepy, seriously. I see 3 primary issues
ranged by priority.

1st, until the failure handler gets smart enough how to deal with
SYSTEM_WORKER_BLOCKED/SYSTEM_CRITICAL_OPERATION_TIMEOUT events we have to
avoid false-positives and print out a warning message instead of stopping a
node. *Andrey*, that's the new behavior of 2.7.5 release according to JIRA,
right?

2nd, the format of the warning/exception message doesn't give any hints for
troubleshooting nor a clue why this happened. I have no idea what to
suggest to those who see exceptions of this kind [1] and have to call for
help from Andrey and other committers. For instance, if to take [1] as a
reference

Critical system error detected. Will be handled accordingly to
configured handler
[hnd=StopNodeOrHaltFailureHandler [*tryStop*=false, *timeout*=0,
super=AbstractFailureHandler
[*ignoredFailureTypes*=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class
o.a.i.IgniteException:
GridWorker [name=grid-timeout-worker,
igniteInstanceName=TravelInventoryTesting,
finished=false, *heartbeatTs*=1553481506244]]] class
org.apache.ignite.IgniteException: GridWorker [name=grid-timeout-worker,
igniteInstanceName=TravelInventoryTesting,
finished=false, *heartbeatTs*=1553481506244]

A lot of the details might be hidden but, unfortunately, but the
interpretation of parameters like heartbeatTs, tryStop,  finished, timeout,
etc is hard. Seems like a message which has to be fed into a complementary
tool which will give me an answer. The format of the message has to help
the user (developer/devops/administrator/architect who has zero affiliation
with the Ignite community) with troubleshooting without calling for help on
the user list:

   - What happened - out of memory/critical error/hanging threads. We're
   already pretty good at that.
   - Why this happened - supply context in human language. For instance,
   "discovery thread was not responding within N seconds because of starvation
   or long GC pause."
   - Troubleshooting guidance - help the user to come around the issue. For
   instance, "Check your GC logs, ensure that compute tasks are not
   oversaturating CPUs causing livelocks. Tune parameter Y and Z."

Would you see anything else? Let's design and enhance.

3rd, full cluster shutdown. Agree, that's harder. Do we have stats when it
usually happens?


[1]
http://apache-ignite-users.70518.x6.nabble.com/Replace-or-Put-after-PutAsync-causes-Ignite-to-hang-td27871.html#a27873

-
Denis


On Sat, Apr 6, 2019 at 11:37 AM Me via Boomerang <dmagda@gridgain.com>
wrote:

> Message moved to top of Inbox by Boomerang (view this conversation
> <https://mail.google.com/mail/u/0/#search/rfc822msgid:%3CCAK0qHnq%3D%3DP_gzftAW3-dTe3j%3DvJo295cFSd%2BLQM43S96vKv3ng%40mail.gmail.com%3E>
> ).
>
> Don't want this notification email in the future? Go to
> https://b4g.baydin.com/settings and uncheck the 'At the top of your
> Inbox' option under Settings. Please note that your Boomeranged messages
> would no longer return to the top of your Inbox.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message