ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: [External]Re: Ignite cluster became unresponsive
Date Mon, 13 Jul 2020 10:59:09 GMT
Hello!

I recommend setting it somewhat lower, but longer than any of your expected
GC pauses. 30s is OK.

Regards,
-- 
Ilya Kasnacheev


вс, 12 июл. 2020 г. в 14:03, Kamlesh Joshi <Kamlesh.Joshi@ril.com>:

> Thanks for the findings Ilya.
>
>
>
> So shall we set the same timeout value for *socketWriteTimeout* as that
> of failure detection timeout on both client and server side?
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> *Sent:* 10 July 2020 19:48
> *To:* user@ignite.apache.org
> *Subject:* Re: [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> Hello!
>
>
>
> It seems that communication connections were closed after CG pause, then
> you have got half-open connections. It is recommended to keep
> socketWriteTimeout and failure detection timeout in relative sync.
>
>
>
> Default socketWriteTimeout on TcpConnectionSpi is very low while your
> failure detection timeout is rather high, leading to such issue.
>
>
>
> It is also possible that client nodes can connect to a server node but not
> vice versa, leading to failure of opening connections once they are closed:
>
>
>
> Thread [name="sys-stripe-12-#13%EDIFCustomerCC%", id=45, state=RUNNABLE,
> blockCnt=851, waitCnt=27526057]
>         at sun.nio.ch.Net.poll(Native Method)
>         at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
>         at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
>         at
> o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> пт, 10 июл. 2020 г. в 16:32, Kamlesh Joshi <Kamlesh.Joshi@ril.com>:
>
> Hi Ilya,
>
>
>
> PFA the entire node logs, which contains thread dump as well. Let us know
> if any findings.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> *Sent:* 10 July 2020 17:51
> *To:* user@ignite.apache.org
> *Subject:* Re: [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> Hello!
>
>
>
> Can you provide full thread dump (jstack) after you see these messages?
>
>
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
>
>
>
>
> ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <Kamlesh.Joshi@ril.com>:
>
> Hi Stephen/Team,
>
>
>
> Did you got any chance to look into this?
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Kamlesh Joshi
> *Sent:* 06 July 2020 14:50
> *To:* user@ignite.apache.org
> *Subject:* RE: [External]Re: Ignite cluster became unresponsive
>
>
>
> Hi Stephen,
>
>
>
> We have started our node with below JVM parameters. Also, we have
> increased these timeouts *failureDetectionTimeout*/
> *clientFailureDetectionTimeout*/*networkTimeout to 480000*.
>
>
>
> *-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
> -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true
> -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000
> -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8
> -DIGNITE_QUIET=false*
>
>
>
> Is there anything else that we have to tune ?
>
>
>
> And I think JVM pause is introduced as a result of the error that we
> encountered right? Correct me if am wrong.
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
> *From:* Stephen Darlington <stephen.darlington@gridgain.com>
> *Sent:* 06 July 2020 14:09
> *To:* user <user@ignite.apache.org>
> *Subject:* [External]Re: Ignite cluster became unresponsive
>
>
>
> The e-mail below is from an external source. Please do not open
> attachments or click links from an unknown or suspicious origin.
>
> There are a few issues here — the blocked thread, the communication error
> — but I possibly the key one is the JVM pause:
>
>
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
>
>
> This is usually due to garbage collection, but there are a number of other
> possibilities such as slow I/O. Suggest you start with the recommendations
> on the GC tuning documentation page:
> https://apacheignite.readme.io/docs/jvm-and-system-tuning
>
>
>
> Regards,
>
> Stephen
>
>
>
> On 4 Jul 2020, at 12:44, Kamlesh Joshi <Kamlesh.Joshi@ril.com> wrote:
>
>
>
> Hi Team,
>
>
>
> We have encountered following defect in PROD environment. After which
> entire traffic got halted for around 10 minutes, we recently upgraded our
> cluster to Ignite 2.7.6 from 2.6.0.
>
> Is this related to any existing open defect in this version? Has anyone
> observed the same defect earlier ?
>
>
>
> Any help or pointers around this will be appreciated.
>
>
>
>
>
> *[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G]
> Blocked system-critical thread has been detected. This can lead to
> cluster-wide undefined behaviour*
>
> *[threadName=partition-exchanger, blockedFor=480s]*
>
> *[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread
> [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING,
> blockCnt=1, waitCnt=2049782]*
>
> *    Lock
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4,
> ownerName=null, ownerId=-1]*
>
>
>
> *[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false,
> heartbeatTs=1593780431612]]]*
>
> *org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
> igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.6.jar:2.7.6]*
>
> *    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]*
>
> *[2020-07-03T18:17:11,625][WARN
> ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads
> detected.*
>
> *[2020-07-03T18:17:21,790][INFO
> ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving
> remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416*
>
> *[2020-07-03T18:17:21,793][WARN
> ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM
> pause: 10133 milliseconds.*
>
> *    [2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,794][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]*
>
> *[2020-07-03T18:17:21,795][WARN
> ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi]
> Communication SPI session write timed out (consider increasing
> 'socketWriteTimeout' configuration property)
> [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]*
>
>
>
>
>
>
>
>
>
>
>
> *Thanks and Regards,*
>
> *Kamlesh Joshi*
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

Mime
View raw message