ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mekhanikov <dmekhani...@gmail.com>
Subject Re: Node stopped.
Date Fri, 18 Oct 2019 10:08:37 GMT
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <java.dev.mtl@gmail.com>, wrote:
> Ok I have metribeat running on the VM hopefully I will see something...
>
> > On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhanikov@gmail.com> wrote:
> > > There are no long pauses in the GC logs, so it must be the whole VM pause.
> > >
> > > Denis
> > > On 16 Oct 2019, 23:07 +0300, John Smith <java.dev.mtl@gmail.com>, wrote:
> > > > Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
> > > >
> > > > > On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev.mtl@gmail.com>
wrote:
> > > > > > Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
> > > > > >
> > > > > > Everything seems ok to me. I also have Elasticsearch Metricbeat
running, the CPU usage looked normal at the time.
> > > > > >
> > > > > > > On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <dmekhanikov@gmail.com>
wrote:
> > > > > > > > Unfortunately, I don’t.
> > > > > > > > You can ask the VM vendor or the cloud provider (if
you use any) for a proper tooling or logs.
> > > > > > > > Make sure, that there is no such step in the VM’s
lifecycle that makes it freeze for a minute.
> > > > > > > > Also make sure that the physical CPU is not overutilized
and no VMs that run on it are starving.
> > > > > > > >
> > > > > > > > Denis
> > > > > > > > On 10 Oct 2019, 19:03 +0300, John Smith <java.dev.mtl@gmail.com>,
wrote:
> > > > > > > > > Do you know of any good tools I can use to check
the VM?
> > > > > > > > >
> > > > > > > > > > On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov
<dmekhanikov@gmail.com> wrote:
> > > > > > > > > > > > Hi Dennis, so are you saying I
should enable GC logs + the safe point logs as well?
> > > > > > > > > > >
> > > > > > > > > > > Having safepoint statistics in your
GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
> > > > > > > > > > > Check the lifecycle of your virtual
machines. There is a high chance that the whole machine is frozen, not just the Ignite node.
> > > > > > > > > > >
> > > > > > > > > > > Denis
> > > > > > > > > > > On 10 Oct 2019, 18:25 +0300, John Smith
<java.dev.mtl@gmail.com>, wrote:
> > > > > > > > > > > > Hi Dennis, so are you saying I
should enable GC logs + the safe point logs as well?
> > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 10 Oct 2019 at 11:22,
John Smith <java.dev.mtl@gmail.com> wrote:
> > > > > > > > > > > > > > You are correct, it
is running in a VM.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 10 Oct
2019 at 10:11, Denis Mekhanikov <dmekhanikov@gmail.com> wrote:
> > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There are
the following messages in the logs:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
Possible too long JVM pause: 55705 milliseconds.
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
Blocked system-critical thread has been detected. This can lead to cluster-wide undefined
behaviour [threadName=partition-exchanger, blockedFor=57s]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looks like
the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection,
since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in
virtualised environments when backups are captured from machines or they just don’t have
enough CPU time.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking at
safepoint statistics may also reveal some interesting details. You can learn about safepoints
here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > > On 9 Oct 2019,
23:14 +0300, John Smith <java.dev.mtl@gmail.com>, wrote:
> > > > > > > > > > > > > > > > > So the
error sais to set clientFailureDetectionTimeout=30000
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1- Do
I put a higher value than 30000?
> > > > > > > > > > > > > > > > > 2- Do
I do it on the client or the server nodes or all nodes?
> > > > > > > > > > > > > > > > > 3- Also
if a client is misbehaving why shutoff the server node?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev.mtl@gmail.com> wrote:
> > > > > > > > > > > > > > > > > >
> But if it's the client node that's failing why is the server node stopping? I'm pretty
sure we do verry simple put and get operations. All the client nodes are started as client=true
> > > > > > > > > > > > > > > > > >
>
> > > > > > > > > > > > > > > > > >
> > On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <dmagda@apache.org> wrote:
> > > > > > > > > > > > > > > > > >
> > > Hi John,
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > I don't see any GC pressure or STW pauses either. If not GC then it might have
been caused by a network glitch or some long-running operation started by the app. These logs
statement
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster, because no metrics
update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms.
It may be caused by network problems or long GC pause on client node, try to increase this
parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster, because no metrics
update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms.
It may be caused by network problems or long GC pause on client node, try to increase this
parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical
thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=57s]
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > -
> > > > > > > > > > > > > > > > > >
> > > Denis
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > > > > > >
> > > > On Thu, Oct 3, 2019 at 11:50 AM John Smith <java.dev.mtl@gmail.com>
wrote:
> > > > > > > > > > > > > > > > > >
> > > > > So I have been monitoring my node and the same one seems to stop
once in a while.
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > I have attached the GC logs and the ignite logs. From what I see
from gc.logs I don't see big pauses. I could be wrong.
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > Here are the JVM settings...
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > if [ -z "$JVM_OPTS" ] ; then
> > > > > > > > > > > > > > > > > >
> > > > >     JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
> > > > > > > > > > > > > > > > > >
> > > > > fi
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-Xloggc:/var/log/apache-ignite/gc.log"
> > > > > > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > > > > > >
> > > > > JVM_OPTS="${JVM_OPTS} -Xss16m"

Mime
View raw message