ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Smith <java.dev....@gmail.com>
Subject Re: Node stopped.
Date Mon, 28 Oct 2019 20:15:35 GMT
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The
client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in
a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect
to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
wrote:

> Hello!
>
> Sure, if nodes can't reach each other, eventually they may segment and
> stop.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev.mtl@gmail.com>:
>
>> Is it possible this is somehow causing the issue of the node stopping?
>>
>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> This likely means that you have reachability problems in your cluster,
>>> such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range
>>> 47100-47200) but not the other way around.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev.mtl@gmail.com>:
>>>
>>>> I also see this printing every few seconds on my client application...
>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>
>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev.mtl@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, thanks. I already made sure that each ignite VM runs on a separate
>>>>> host within our cloud. I'm not doing any of that migration stuff.
>>>>>
>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0);
>>>>> Just to make sure it doesn't get too chatty. But i doubt this would affect
>>>>> it...
>>>>>
>>>>> Should I maybe set the node timeout a bit higher then 30 seconds maybe
>>>>> put it to 60 seconds? I remember people suggesting this but not sure...
>>>>>
>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <dmekhanikov@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The following documentation page has some useful points on deployment
>>>>>> in a virtualised environment:
>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>
>>>>>> Denis
>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>> wrote:
>>>>>>
>>>>>> Ok I have metribeat running on the VM hopefully I will see
>>>>>> something...
>>>>>>
>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <dmekhanikov@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There are no long pauses in the GC logs, so it must be the whole
VM
>>>>>>> pause.
>>>>>>>
>>>>>>> Denis
>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>> wrote:
>>>>>>>
>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>
>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev.mtl@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, so it happened again here is my latest gc.log stats:
>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>
>>>>>>>> Everything seems ok to me. I also have Elasticsearch Metricbeat
>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>
>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>> You can ask the VM vendor or the cloud provider (if you
use any)
>>>>>>>>> for a proper tooling or logs.
>>>>>>>>> Make sure, that there is no such step in the VM’s lifecycle
that
>>>>>>>>> makes it freeze for a minute.
>>>>>>>>> Also make sure that the physical CPU is not overutilized
and no
>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Do you know of any good tools I can use to check the
VM?
>>>>>>>>>
>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> > Hi Dennis, so are you saying I should enable
GC logs + the safe
>>>>>>>>>> point logs as well?
>>>>>>>>>>
>>>>>>>>>> Having safepoint statistics in your GC logs may be
useful, so I
>>>>>>>>>> recommend enabling them for troubleshooting purposes.
>>>>>>>>>> Check the lifecycle of your virtual machines. There
is a high
>>>>>>>>>> chance that the whole machine is frozen, not just
the Ignite node.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Dennis, so are you saying I should enable GC logs
+ the safe
>>>>>>>>>> point logs as well?
>>>>>>>>>>
>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith <java.dev.mtl@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You are correct, it is running in a VM.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov
<
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi!
>>>>>>>>>>>>
>>>>>>>>>>>> There are the following messages in the logs:
>>>>>>>>>>>>
>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>> Possible too long JVM pause: *55705 milliseconds*.
>>>>>>>>>>>> ...
>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>> Blocked system-critical thread has been detected.
This can lead to
>>>>>>>>>>>> cluster-wide undefined behaviour [threadName=partition-exchanger,
>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like the JVM was paused for almost
a minute. It doesn’t
>>>>>>>>>>>> seem to be caused by a garbage collection,
since there is no evidence of GC
>>>>>>>>>>>> pressure in the GC log. Usually such big
pauses happen in virtualised
>>>>>>>>>>>> environments when backups are captured from
machines or they just don’t
>>>>>>>>>>>> have enough CPU time.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking at safepoint statistics may also
reveal some
>>>>>>>>>>>> interesting details. You can learn about
safepoints here:
>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>
>>>>>>>>>>>> 1- Do I put a higher value than 30000?
>>>>>>>>>>>> 2- Do I do it on the client or the server
nodes or all nodes?
>>>>>>>>>>>> 3- Also if a client is misbehaving why shutoff
the server node?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02, John Smith <java.dev.mtl@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> But if it's the client node that's failing
why is the server
>>>>>>>>>>>>> node stopping? I'm pretty sure we do
verry simple put and get operations.
>>>>>>>>>>>>> All the client nodes are started as client=true
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18 p.m. Denis
Magda, <
>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't see any GC pressure or STW
pauses either. If not GC
>>>>>>>>>>>>>> then it might have been caused by
a network glitch or some long-running
>>>>>>>>>>>>>> operation started by the app. These
logs statement
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>> Client node considered as unreachable
and will be dropped from cluster,
>>>>>>>>>>>>>> because no metrics update messages
received in interval:
>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout()
ms. It may be caused by
>>>>>>>>>>>>>> network problems or long GC pause
on client node, try to increase this
>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>> Client node considered as unreachable
and will be dropped from cluster,
>>>>>>>>>>>>>> because no metrics update messages
received in interval:
>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout()
ms. It may be caused by
>>>>>>>>>>>>>> network problems or long GC pause
on client node, try to increase this
>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>> Blocked system-critical thread has
been detected. This can lead to
>>>>>>>>>>>>>> cluster-wide undefined behaviour
[threadName=partition-exchanger,
>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>> Critical system error detected. Will
be handled accordingly to configured
>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0,
>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext
>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED, err=class
o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>> [name=partition-exchanger, igniteInstanceName=xxxxxx,
finished=false,
>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 11:50 AM John
Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So I have been monitoring my
node and the same one seems to
>>>>>>>>>>>>>>> stop once in a while.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have attached the GC logs and
the ignite logs. From what I
>>>>>>>>>>>>>>> see from gc.logs I don't see
big pauses. I could be wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The machine is 16GB and I have
the configs here:
>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here are the JVM settings...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g -Xmx2g -server
>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC
-verbose:gc
>>>>>>>>>>>>>>> -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS} -Xss16m"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Mime
View raw message