ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Node stopped.
Date Tue, 29 Oct 2019 15:13:52 GMT
Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
*Possible
too long JVM pause: 55705 milliseconds*.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
Client node considered as unreachable and will be dropped from cluster,
because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked
system-critical thread has been detected. This can lead to cluster-wide
undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi]
Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827,
rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread
[name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING,
blockCnt=6, waitCnt=486312]
    Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d,
ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or
increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on
all nodes.

Regards,
-- 
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <java.dev.mtl@gmail.com>:

> Hi, in my original post I had included full logs of the stopped node...
>
> And when you say set to localhost you which property do you mean is it the
> one IgniteConfiguration setLocalhost()?
>
> On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
>> Hello!
>>
>> I think it is possible (such as, node tries to connect to wrong address /
>> itself) but more detailed analysis of logs is needed.
>>
>> You can try specifying localHost property in IgniteConfiguration to make
>> sure correct address is used.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пн, 28 окт. 2019 г. в 23:16, John Smith <java.dev.mtl@gmail.com>:
>>
>>> Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.
>>>
>>> When I connect with ignite visor and run the "top" topology command. The
>>> client node is bound to multiple addresses about 12...
>>>
>>> Ignite is running on plain regular VMs while the client node is running
>>> in a containerized env but using host network (which is regular VM).
>>>
>>> So wondering if that would also screw it up something, by trying to
>>> connect to one of those multiple addresses?
>>>
>>> On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> Sure, if nodes can't reach each other, eventually they may segment and
>>>> stop.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 25 окт. 2019 г. в 00:08, John Smith <java.dev.mtl@gmail.com>:
>>>>
>>>>> Is it possible this is somehow causing the issue of the node stopping?
>>>>>
>>>>> On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> This likely means that you have reachability problems in your
>>>>>> cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on
>>>>>> range 47100-47200) but not the other way around.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 21 окт. 2019 г. в 19:36, John Smith <java.dev.mtl@gmail.com>:
>>>>>>
>>>>>>> I also see this printing every few seconds on my client
>>>>>>> application...
>>>>>>> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted
>>>>>>> incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101,
>>>>>>> rmtAddr=/xxx.xxx.xxx.82:49816
>>>>>>>
>>>>>>> On Mon, 21 Oct 2019 at 12:04, John Smith <java.dev.mtl@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, thanks. I already made sure that each ignite VM runs
on a
>>>>>>>> separate host within our cloud. I'm not doing any of that
migration stuff.
>>>>>>>>
>>>>>>>> Also recently disabled metrics igniteConfig.setMetricsLogFrequency(
>>>>>>>> 0); Just to make sure it doesn't get too chatty. But i doubt
this
>>>>>>>> would affect it...
>>>>>>>>
>>>>>>>> Should I maybe set the node timeout a bit higher then 30
seconds
>>>>>>>> maybe put it to 60 seconds? I remember people suggesting
this but not
>>>>>>>> sure...
>>>>>>>>
>>>>>>>> On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <
>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The following documentation page has some useful points
on
>>>>>>>>> deployment in a virtualised environment:
>>>>>>>>> https://apacheignite.readme.io/docs/vmware-deployment
>>>>>>>>>
>>>>>>>>> Denis
>>>>>>>>> On 17 Oct 2019, 17:41 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Ok I have metribeat running on the VM hopefully I will
see
>>>>>>>>> something...
>>>>>>>>>
>>>>>>>>> On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <
>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> There are no long pauses in the GC logs, so it must
be the whole
>>>>>>>>>> VM pause.
>>>>>>>>>>
>>>>>>>>>> Denis
>>>>>>>>>> On 16 Oct 2019, 23:07 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Sorry here is the gc logs for all 3 machines:
>>>>>>>>>> https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0
>>>>>>>>>>
>>>>>>>>>> On Wed, 16 Oct 2019 at 15:49, John Smith <java.dev.mtl@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, so it happened again here is my latest gc.log
stats:
>>>>>>>>>>> https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e
>>>>>>>>>>>
>>>>>>>>>>> Everything seems ok to me. I also have Elasticsearch
Metricbeat
>>>>>>>>>>> running, the CPU usage looked normal at the time.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov
<
>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, I don’t.
>>>>>>>>>>>> You can ask the VM vendor or the cloud provider
(if you use
>>>>>>>>>>>> any) for a proper tooling or logs.
>>>>>>>>>>>> Make sure, that there is no such step in
the VM’s lifecycle
>>>>>>>>>>>> that makes it freeze for a minute.
>>>>>>>>>>>> Also make sure that the physical CPU is not
overutilized and no
>>>>>>>>>>>> VMs that run on it are starving.
>>>>>>>>>>>>
>>>>>>>>>>>> Denis
>>>>>>>>>>>> On 10 Oct 2019, 19:03 +0300, John Smith <java.dev.mtl@gmail.com>,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Do you know of any good tools I can use to
check the VM?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov
<
>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Hi Dennis, so are you saying I should
enable GC logs + the
>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Having safepoint statistics in your GC
logs may be useful, so
>>>>>>>>>>>>> I recommend enabling them for troubleshooting
purposes.
>>>>>>>>>>>>> Check the lifecycle of your virtual machines.
There is a high
>>>>>>>>>>>>> chance that the whole machine is frozen,
not just the Ignite node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> On 10 Oct 2019, 18:25 +0300, John Smith
<
>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Dennis, so are you saying I should
enable GC logs + the
>>>>>>>>>>>>> safe point logs as well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 11:22, John Smith
<
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You are correct, it is running in
a VM.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 10 Oct 2019 at 10:11, Denis
Mekhanikov <
>>>>>>>>>>>>>> dmekhanikov@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are the following messages
in the logs:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>>>>>>>>>>>>>> Possible too long JVM pause:
*55705 milliseconds*.
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>> Blocked system-critical thread
has been detected. This can lead to
>>>>>>>>>>>>>>> cluster-wide undefined behaviour
[threadName=partition-exchanger,
>>>>>>>>>>>>>>> blockedFor=*57s*]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like the JVM was paused
for almost a minute. It
>>>>>>>>>>>>>>> doesn’t seem to be caused by
a garbage collection, since there is no
>>>>>>>>>>>>>>> evidence of GC pressure in the
GC log. Usually such big pauses happen in
>>>>>>>>>>>>>>> virtualised environments when
backups are captured from machines or they
>>>>>>>>>>>>>>> just don’t have enough CPU
time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking at safepoint statistics
may also reveal some
>>>>>>>>>>>>>>> interesting details. You can
learn about safepoints here:
>>>>>>>>>>>>>>> https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>> On 9 Oct 2019, 23:14 +0300, John
Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>, wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the error sais to set clientFailureDetectionTimeout=30000
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1- Do I put a higher value than
30000?
>>>>>>>>>>>>>>> 2- Do I do it on the client or
the server nodes or all nodes?
>>>>>>>>>>>>>>> 3- Also if a client is misbehaving
why shutoff the server
>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 3 Oct 2019 at 21:02,
John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But if it's the client node
that's failing why is the
>>>>>>>>>>>>>>>> server node stopping? I'm
pretty sure we do verry simple put and get
>>>>>>>>>>>>>>>> operations. All the client
nodes are started as client=true
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu., Oct. 3, 2019, 4:18
p.m. Denis Magda, <
>>>>>>>>>>>>>>>> dmagda@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't see any GC pressure
or STW pauses either. If not
>>>>>>>>>>>>>>>>> GC then it might have
been caused by a network glitch or some long-running
>>>>>>>>>>>>>>>>> operation started by
the app. These logs statement
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>> Client node considered
as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>> because no metrics update
messages received in interval:
>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout()
ms. It may be caused by
>>>>>>>>>>>>>>>>> network problems or long
GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>> parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef,
>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi]
>>>>>>>>>>>>>>>>> Client node considered
as unreachable and will be dropped from cluster,
>>>>>>>>>>>>>>>>> because no metrics update
messages received in interval:
>>>>>>>>>>>>>>>>> TcpDiscoverySpi.clientFailureDetectionTimeout()
ms. It may be caused by
>>>>>>>>>>>>>>>>> network problems or long
GC pause on client node, try to increase this
>>>>>>>>>>>>>>>>> parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d,
>>>>>>>>>>>>>>>>> clientFailureDetectionTimeout=30000]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G]
>>>>>>>>>>>>>>>>> Blocked system-critical
thread has been detected. This can lead to
>>>>>>>>>>>>>>>>> cluster-wide undefined
behaviour [threadName=partition-exchanger,
>>>>>>>>>>>>>>>>> blockedFor=57s]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][]
>>>>>>>>>>>>>>>>> Critical system error
detected. Will be handled accordingly to configured
>>>>>>>>>>>>>>>>> handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0,
>>>>>>>>>>>>>>>>> super=AbstractFailureHandler
>>>>>>>>>>>>>>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]],
failureCtx=FailureContext
>>>>>>>>>>>>>>>>> [type=SYSTEM_WORKER_BLOCKED,
err=class o.a.i.IgniteException: GridWorker
>>>>>>>>>>>>>>>>> [name=partition-exchanger,
igniteInstanceName=xxxxxx, finished=false,
>>>>>>>>>>>>>>>>> heartbeatTs=1568931981805]]]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at
11:50 AM John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So I have been monitoring
my node and the same one seems
>>>>>>>>>>>>>>>>>> to stop once in a
while.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/7n5qfsl5uyi1obt/ignite-logs.zip?dl=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have attached the
GC logs and the ignite logs. From
>>>>>>>>>>>>>>>>>> what I see from gc.logs
I don't see big pauses. I could be wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The machine is 16GB
and I have the configs here:
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here are the JVM
settings...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> if [ -z "$JVM_OPTS"
] ; then
>>>>>>>>>>>>>>>>>>     JVM_OPTS="-Xms2g
-Xmx2g -server
>>>>>>>>>>>>>>>>>> -XX:MaxMetaspaceSize=256m"
>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> JVM_OPTS="$JVM_OPTS
-XX:+UseG1GC -verbose:gc
>>>>>>>>>>>>>>>>>> -XX:+PrintGCDetails
-Xloggc:/var/log/apache-ignite/gc.log"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> JVM_OPTS="${JVM_OPTS}
-Xss16m"
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Mime
View raw message