ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alena Laas <alena.l...@cbsinteractive.com>
Subject Re: Ignite in Kubernetes not works correctly
Date Mon, 14 Jan 2019 09:34:08 GMT
It seems that increasing joinTimeout and failureDetectionTimeout solved the
problem.

On Fri, Jan 11, 2019 at 5:24 PM Alena Laas <alena.laas@cbsinteractive.com>
wrote:

> I attached part of the log with "node failed" events (100.99.129.141 - ip
> of restarted node)
>
> These events are repeated until suddenly after about 40 min - an hour node
> is connected to cluster.
>
> Could you explain why this is happening?
>
> On Thu, Jan 10, 2019 at 7:54 PM Alena Laas <alena.laas@cbsinteractive.com>
> wrote:
>
>> We are using Azure AKS cluster.
>>
>> We kill pod using Kubernetes dashboard or through kubectl (kubectl delete
>> pods <name>), never mind, result is the same.
>>
>> Maybe you need some more logs from us?
>>
>> On Thu, Jan 10, 2019 at 7:28 PM Stephen Darlington <
>> stephen.darlington@gridgain.com> wrote:
>>
>>> What kind of environment are you using? A public cloud? Your own data
>>> centre? And how are you killing the pod?
>>>
>>> I fired up a cluster using Minikube and your configuration and it worked
>>> as far as I could see. (I deleted the pod using the dashboard, for what
>>> that’s worth.)
>>>
>>> Regards,
>>> Stephen
>>>
>>> On 10 Jan 2019, at 14:20, Alena Laas <alena.laas@cbsinteractive.com>
>>> wrote:
>>>
>>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Alena Laas <alena.laas@cbsinteractive.com>
>>> Date: Thu, Jan 10, 2019 at 5:13 PM
>>> Subject: Ignite in Kubernetes not works correctly
>>> To: <user@ignite.apache.org>
>>> Cc: Vadim Shcherbakov <vadim.shcherbakov@cbsinteractive.com>
>>>
>>>
>>> Hello!
>>> Could you please help with some problem with Ignite within Kubernetes
>>> cluster?
>>>
>>> When we start 2 Ignite nodes at the same time or use scaling for
>>> Deployment (from 1 to 2) everything is fine, both of them are visible
>>> inside Ignite cluster (we use web console to see it)
>>>
>>> But after we kill pod with one node and it restarts the node is no more
>>> seen in Ignite cluster. Moreover the logs from this restarted node look
>>> poor:
>>> [13:32:57] __________ ________________
>>> [13:32:57] / _/ ___/ |/ / _/_ __/ __/
>>> [13:32:57] _/ // (7 7 // / / / / _/
>>> [13:32:57] /___/\___/_/|_/___/ /_/ /___/
>>> [13:32:57]
>>> [13:32:57] ver. 2.7.0#20181130-sha1:256ae401
>>> [13:32:57] 2018 Copyright(C) Apache Software Foundation
>>> [13:32:57]
>>> [13:32:57] Ignite documentation: http://ignite.apache.org
>>> [13:32:57]
>>> [13:32:57] Quiet mode.
>>> [13:32:57] ^-- Logging to file
>>> '/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log'
>>> [13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
>>> [13:32:57] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
>>> or "-v" to ignite.{sh|bat}
>>> [13:32:57]
>>> [13:32:57] OS: Linux 4.15.0-1036-azure amd64
>>> [13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13
>>> Oracle Corporation OpenJDK 64-Bit Server VM 25.181-b13
>>> [13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true'
>>> to avoid possible problems in mixed environments.
>>> [13:32:57] Configured plugins:
>>> [13:32:57] ^-- None
>>> [13:32:57]
>>> [13:32:57] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
>>> [tryStop=false, timeout=0, super=AbstractFailureHandler
>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]]]
>>> [13:32:58] Message queue limit is set to 0 which may lead to potential
>>> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
>>> to message queues growth on sender and receiver sides.
>>> [13:32:58] Security status [authentication=off, tls/ssl=off]
>>>
>>> And logs from the remaining node say that there are either 2 or 1 server
>>> and this info is blinking
>>> [14:02:05] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:02:15] Topology snapshot [ver=234, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:02:15] Topology snapshot [ver=235, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:02:20] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:02:30] Topology snapshot [ver=236, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:02:30] Topology snapshot [ver=237, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:02:35] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:02:45] Topology snapshot [ver=238, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:02:45] Topology snapshot [ver=239, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:02:50] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:03:00] Topology snapshot [ver=240, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:03:00] Topology snapshot [ver=241, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:03:06] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:03:16] Topology snapshot [ver=242, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:03:16] Topology snapshot [ver=243, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:03:21] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:03:31] Topology snapshot [ver=244, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:03:31] Topology snapshot [ver=245, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:03:36] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:03:46] Topology snapshot [ver=246, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:03:46] Topology snapshot [ver=247, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:03:51] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:04:01] Topology snapshot [ver=248, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:04:01] Topology snapshot [ver=249, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:04:06] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>>
>>> I am attaching our config file for Ignite server and yaml files for
>>> Kubernetes. Everything there was done according to your official
>>> documentation. Ignite version we are trying now is 2.7.0
>>> Looking forward to getting an answer from you.
>>>
>>> --
>>>
>>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA)
>>> CNET Content Solutions
>>> OFFICE +7.495.967.1201 FAX +7.495.967.1203
>>> 5 Letnikovskaya str., Moscow, Russia, 115114
>>> [image: CNET Content Solutions]
>>>
>>>
>>> --
>>>
>>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA)
>>> CNET Content Solutions
>>> OFFICE +7.495.967.1201 FAX +7.495.967.1203
>>> 5 Letnikovskaya str., Moscow, Russia, 115114
>>> [image: CNET Content Solutions]
>>> <ignite-config-server.xml><fcat-ignite-stage.yaml>
>>>
>>>
>>>
>>>
>>
>> --
>>
>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA)
>> CNET Content Solutions
>> OFFICE +7.495.967.1201 FAX +7.495.967.1203
>> 5 Letnikovskaya str., Moscow, Russia, 115114
>> [image: CNET Content Solutions]
>>
>
>
> --
>
> *ALENA LAAS*SOFTWARE ENGINEER (JAVA)
> CNET Content Solutions
> OFFICE +7.495.967.1201 FAX +7.495.967.1203
> 5 Letnikovskaya str., Moscow, Russia, 115114
> [image: CNET Content Solutions]
>


-- 

*ALENA LAAS*SOFTWARE ENGINEER (JAVA)
CNET Content Solutions
OFFICE +7.495.967.1201 FAX +7.495.967.1203
5 Letnikovskaya str., Moscow, Russia, 115114
[image: CNET Content Solutions]

Mime
View raw message