Return-Path: X-Original-To: apmail-ignite-user-archive@minotaur.apache.org Delivered-To: apmail-ignite-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5A5A182A2 for ; Tue, 3 Nov 2015 14:42:10 +0000 (UTC) Received: (qmail 46254 invoked by uid 500); 3 Nov 2015 14:42:10 -0000 Delivered-To: apmail-ignite-user-archive@ignite.apache.org Received: (qmail 46209 invoked by uid 500); 3 Nov 2015 14:42:10 -0000 Mailing-List: contact user-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ignite.apache.org Delivered-To: mailing list user@ignite.apache.org Received: (qmail 46199 invoked by uid 99); 3 Nov 2015 14:42:10 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2015 14:42:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2068FC80B8 for ; Tue, 3 Nov 2015 14:42:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gridgain_com.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id irhqyrrnhoGK for ; Tue, 3 Nov 2015 14:41:55 +0000 (UTC) Received: from mail-lf0-f50.google.com (mail-lf0-f50.google.com [209.85.215.50]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 44AE920B91 for ; Tue, 3 Nov 2015 14:41:55 +0000 (UTC) Received: by lfgh9 with SMTP id h9so20197654lfg.1 for ; Tue, 03 Nov 2015 06:41:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gridgain_com.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-type; bh=UPo6wurANsRHVqSEI59UWQsyb8xv12Mk75HGdmyjwOk=; b=qB94POLbdhf7UjEa+LMMITx+ShUUtpioo8/elKBLV5vtk0DtfSVnAeEZXBHmH1wh+e h5FN1omx2HAN+nQiLBnNfaxGSSgI34G/qeJXUpZsielbJYnW2OcADH9Bf0xPmZaNTonb WTzD3fTKexCignF5XQIzN4kR0qq5cpy4+sBbztU/bdiSPz8jNgp9xDsGKoralmnMLeVG Yj8FsuwaJznXg4T2xEXktHya7ju78tvtktAE9tt0zQi+PzuzUK425hUiGK0VgPTLjedZ 8YdyAfKbxceVyGHqwv/3QUlLctXgGfqSy2xrrVZcPMPiUUtE2LiksUlbvE35KprDJe/B 5Egg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-type; bh=UPo6wurANsRHVqSEI59UWQsyb8xv12Mk75HGdmyjwOk=; b=abMyMEl0+S6+cRmKTj1j36hm+InVveiJfVBkvmFv9dovI6j7g9XLkuHKESWilz3sl5 x9PxzwQx5HXFvY1tGJPu6psU+7BH18jrV+nMnakKEh17C8MhsgUo+IRnbo2byGz898EI 3uFs4cT8qWHP151j1gfO/kX+HzPYwXkMTl3Gjo1dC9snLRJnnH8/Lgv8O1NmkdCjua89 bhv3hkcsloaUFVPVhYQDPbn8kr83MYHdzlza0gUDjdO6l7sRFCoDjmuTtcmPVjp3Rvqc 2j/IL5z9q3diY1hpWSx8jbuwpg5DMBqKbHD2/bQUepwYzOVu/aezxDbwsN45ksCNA3H5 r6sQ== X-Gm-Message-State: ALoCoQl6AuL5aMc51jQSKIFsARIBaKBJc8TW9Bt4gtjxSSHX394UYVwKMJUTR9lKsr0tlD9M2fLW X-Received: by 10.25.165.206 with SMTP id o197mr4466894lfe.83.1446561714514; Tue, 03 Nov 2015 06:41:54 -0800 (PST) Received: from [10.0.0.18] ([94.72.60.102]) by smtp.googlemail.com with ESMTPSA id i3sm4842161lbj.0.2015.11.03.06.41.52 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 03 Nov 2015 06:41:53 -0800 (PST) Subject: Re: Help with tuning for larger clusters To: user@ignite.apache.org References: <20151027063726.Horde.bewrnEtBluYGVyCBIeKCrw1@www.eiler.net> <562F3AD8.6030902@gridgain.com> <20151028104033.Horde.3A0WSbFq9t6fdEgHmAueLw5@www.eiler.net> <5630D5F4.4070302@gridgain.com> <20151029133734.Horde.eaEyQdOQ7gotp8kUs5LoCQ1@www.eiler.net> <56336914.1050001@gridgain.com> <20151030172607.Horde.K-o9xBE1tuK2On4d7dqBuw2@www.eiler.net> <20151031001355.Horde.yNz8AoRQqBzvULhIvYfLtA4@www.eiler.net> <1446463114860-1813.post@n6.nabble.com> <1446470087479-1814.post@n6.nabble.com> <20151103124823.Horde.P8-vUF69z5NGh94914LHwg2@www.eiler.net> From: Denis Magda Message-ID: <5638C7AF.9040602@gridgain.com> Date: Tue, 3 Nov 2015 17:41:51 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20151103124823.Horde.P8-vUF69z5NGh94914LHwg2@www.eiler.net> Content-Type: multipart/alternative; boundary="------------090509020702090509010004" This is a multi-part message in MIME format. --------------090509020702090509010004 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Hi Joe, It's nice to hear from you. Please see below. On 11/3/2015 3:48 PM, dev@eiler.net wrote: > Sorry for the delayed response. Thanks for opening the jira bug, I had > also noticed there is another being actively worked about rebalancing > being slow. > > 1) Yep, before dropping the port range it took several minutes before > everyone joined the topology. Remember I can't use multicast so I have > a single IP configured that everyone has to talk to for discovery. > > 1a) The underlying network is FDR infiniband. All throughput and > latency numbers are as expected with both IB based benchmarks. I've > also run sockperf between nodes to get socket/IP performance and it > was as expected (it takes a pretty big hit in both throughput and > latency, but that is normal with the IP stack.) I don't have the > numbers handy, but I believe sockperf showed about 2.2 GBytes/s > throughput for any single point-to-point connection. > > 1b) The cluster has a shared login node and the filesystem is shared, > otherwise the individual nodes that I am launching ignite.sh on are > exclusively mine, their own physical entities, and not being used for > anything else. I'm not taking all the cluster nodes so there are > other people running on other nodes accessing both the IB network and > the shared filesystem(but not my ignite installation directory, so not > the same files) > *Ivan*, don't we have any know IGFS-related issues when a shared filesystem is used by nodes? > 2) lol, yeah, that is what I was trying to do when I started the > thread. I'll go back and start that process again. > Before trying to play with every parameter try to increase that one TcpCommunicationSpi.socketWriteTimeout. In your case it was initialized by default value (5 secs). > 3) Every now and then I have an ignite process that doesn't shutdown > with my pssh kill command and required a kill -9. I try to check every > node to make sure all the java processes have terminated (pssh ps -eaf > | grep java) but I could have missed one. I'll try to keep an eye out > for those messages as well. I've also had issues where I've stopped > and restarted the nodes too quick and the port isn't released yet. I would recommend you to use 'jps' tool to get a list of all running Java processes because sometimes the processes are renamed to non 'java' name. http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html > > 4) Over the weekend I had a successful 64 node run, and when it came > up I didn't see any "Retry partition exchange messages". I let it sit > for a couple hours and everything stayed up and happy. I then started > running pi estimator with increasing number of mappers. I think it is > when I was doing 10000 mappers that it got about 71% through and then > stopped making progress although I kept seeing the ignite messages for > inter node communication. When I noticed it was "stuck" then there was > an NIO exception in the logs. I haven't looked at the logs in detail > yet but the topology seemed intact and everything was up and running > well over 12 hours. > Could you share example's source code with us? Probably we will note something strange. In addition, next time when your nodes get stuck please make thread dumps and heap dumps and share with us for analysis. -- Denis > I might need to put this on the back burner for a little bit, we'll see. > > Joe > > > > Quoting Denis Magda : > >> Joe, >> >> Thanks for the clarifications. Now we're on the same page. >> >> It's great that the cluster is initially assembled without any issue >> and you >> see that all 64 joined the topology. >> >> In regards to 'rebalancing timeout' warnings I have the following >> thoughts. >> >> First, I've opened a bug that describes your and similar cases that >> happens >> on big cluster with rebalancing. You may want to track it: >> https://issues.apache.org/jira/browse/IGNITE-1837 >> >> Second, I'm not sure that this bug is 100% your case and doesn't >> guarantee >> that the issue on your side disappears when it gets fixed. That's why >> lets >> check the following. >> >> 1) As far as I remember before we decreased the port range used by >> discovery >> it took significant time for you to form the cluster of 64 nodes. >> What are >> the settings of your network (throughput, 10GB or 1GB)? How do you >> use this >> servers? Are they already under the load by some other apps that >> decrease >> network throughput? I think you should find out whether everything is >> OK in >> this area or not. IMHO at least the situation is not ideal. >> >> 2) Please increate TcpCommunicationSpi.socketWriteTimeout to 15 secs >> (the >> same value that failureDetectionTimeout has). >> Actually you may want to try configuring network related parameters >> directly >> instead of relying on failureDetectionTimeout: >> - TcpCommunicationSpi.socketWriteTimeout >> - TcpCommunicationSpi.connectTimeout >> - TcpDiscoverySpi.socketTimeout >> - TcpDiscoverySpi.ackTimeout >> >> 3) In some logs I see that IGFS endpoint failed to start. Please >> check who >> occupies that port number. >> [07:33:41,736][WARN ][main][IgfsServerManager] Failed to start IGFS >> endpoint >> (will retry every 3s). Failed to bind to port (is port already in use?): >> 10500 >> >> 4) Please turn off IGFS/HDFS/Hadoop at all and start the cluster. Let's >> check how long it will live in the idle state. But please take into >> account >> 1) before. >> >> Regards, >> Denis >> >> >> >> >> -- >> View this message in context: >> http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-tp1692p1814.html >> Sent from the Apache Ignite Users mailing list archive at Nabble.com. > > > --------------090509020702090509010004 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit Hi Joe,

It's nice to hear from you. Please see below.

On 11/3/2015 3:48 PM, dev@eiler.net wrote:
Sorry for the delayed response. Thanks for opening the jira bug, I had also noticed there is another being actively worked about rebalancing being slow.

1) Yep, before dropping the port range it took several minutes before everyone joined the topology. Remember I can't use multicast so I have a single IP configured that everyone has to talk to for discovery.

1a) The underlying network is FDR infiniband. All throughput and latency numbers are as expected with both IB based benchmarks. I've also run sockperf between nodes to get socket/IP performance and it was as expected (it takes a pretty big hit in both throughput and latency, but that is normal with the IP stack.) I don't have the numbers handy, but I believe sockperf showed about 2.2 GBytes/s throughput for any single point-to-point connection.

1b) The cluster has a shared login node and the filesystem is shared, otherwise the individual nodes that I am launching ignite.sh on are exclusively mine, their own physical entities, and not being used for anything else.  I'm not taking all the cluster nodes so there are other people running on other nodes accessing both the IB network and the shared filesystem(but not my ignite installation directory, so not the same files)

Ivan, don't we have any know IGFS-related issues when a shared filesystem is used by nodes?

2) lol, yeah, that is what I was trying to do when I started the thread. I'll go back and start that process again.

Before trying to play with every parameter try to increase that one TcpCommunicationSpi.socketWriteTimeout. In your case it was initialized by default value (5 secs).

3) Every now and then I have an ignite process that doesn't shutdown with my pssh kill command and required a kill -9. I try to check every node to make sure all the java processes have terminated (pssh ps -eaf | grep java) but I could have missed one. I'll try to keep an eye out for those messages as well. I've also had issues where I've stopped and restarted the nodes too quick and the port isn't released yet.
I would recommend you to use 'jps' tool to get a list of all running Java processes because sometimes the processes are renamed to non 'java' name.
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html


4) Over the weekend I had a successful 64 node run, and when it came up I didn't see any "Retry partition exchange messages". I let it sit for a couple hours and everything stayed up and happy. I then started running pi estimator with increasing number of mappers. I think it is when I was doing 10000 mappers that it got about 71% through and then stopped making progress although I kept seeing the ignite messages for inter node communication. When I noticed it was "stuck" then there was an NIO exception in the logs. I haven't looked at the logs in detail yet but the topology seemed intact and everything was up and running well over 12 hours.

Could you share example's source code with us? Probably we will note something strange.
In addition, next time when your nodes get stuck please make thread dumps and heap dumps and share with us for analysis.

--
Denis

I might need to put this on the back burner for a little bit, we'll see.

Joe



Quoting Denis Magda <dmagda@gridgain.com>:

Joe,

Thanks for the clarifications. Now we're on the same page.

It's great that the cluster is initially assembled without any issue and you
see that all 64 joined the topology.

In regards to 'rebalancing timeout' warnings I have the following thoughts.

First, I've opened a bug that describes your and similar cases that happens
on big cluster with rebalancing. You may want to track it:
https://issues.apache.org/jira/browse/IGNITE-1837

Second, I'm not sure that this bug is 100% your case and doesn't guarantee
that the issue on your side disappears when it gets fixed. That's why lets
check the following.

1) As far as I remember before we decreased the port range used by discovery
it took significant time for you to form the cluster of 64 nodes. What are
the settings of your network (throughput, 10GB or 1GB)? How do you use this
servers? Are they already under the load by some other apps that decrease
network throughput? I think you should find out whether everything is OK in
this area or not. IMHO at least the situation is not ideal.

2) Please increate TcpCommunicationSpi.socketWriteTimeout to 15 secs (the
same value that failureDetectionTimeout has).
Actually you may want to try configuring network related parameters directly
instead of relying on failureDetectionTimeout:
- TcpCommunicationSpi.socketWriteTimeout
- TcpCommunicationSpi.connectTimeout
- TcpDiscoverySpi.socketTimeout
- TcpDiscoverySpi.ackTimeout

3) In some logs I see that IGFS endpoint failed to start. Please check who
occupies that port number.
[07:33:41,736][WARN ][main][IgfsServerManager] Failed to start IGFS endpoint
(will retry every 3s). Failed to bind to port (is port already in use?):
10500

4) Please turn off IGFS/HDFS/Hadoop at all and start the cluster. Let's
check how long it will live in the idle state. But please take into account
1) before.

Regards,
Denis




--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-tp1692p1814.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.




--------------090509020702090509010004--