ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d..@eiler.net
Subject Re: Help with tuning for larger clusters
Date Tue, 03 Nov 2015 12:48:23 GMT
Sorry for the delayed response. Thanks for opening the jira bug, I had  
also noticed there is another being actively worked about rebalancing  
being slow.

1) Yep, before dropping the port range it took several minutes before  
everyone joined the topology. Remember I can't use multicast so I have  
a single IP configured that everyone has to talk to for discovery.

1a) The underlying network is FDR infiniband. All throughput and  
latency numbers are as expected with both IB based benchmarks. I've  
also run sockperf between nodes to get socket/IP performance and it  
was as expected (it takes a pretty big hit in both throughput and  
latency, but that is normal with the IP stack.) I don't have the  
numbers handy, but I believe sockperf showed about 2.2 GBytes/s  
throughput for any single point-to-point connection.

1b) The cluster has a shared login node and the filesystem is shared,  
otherwise the individual nodes that I am launching ignite.sh on are  
exclusively mine, their own physical entities, and not being used for  
anything else.  I'm not taking all the cluster nodes so there are  
other people running on other nodes accessing both the IB network and  
the shared filesystem(but not my ignite installation directory, so not  
the same files)

2) lol, yeah, that is what I was trying to do when I started the  
thread. I'll go back and start that process again.

3) Every now and then I have an ignite process that doesn't shutdown  
with my pssh kill command and required a kill -9. I try to check every  
node to make sure all the java processes have terminated (pssh ps -eaf  
| grep java) but I could have missed one. I'll try to keep an eye out  
for those messages as well. I've also had issues where I've stopped  
and restarted the nodes too quick and the port isn't released yet.

4) Over the weekend I had a successful 64 node run, and when it came  
up I didn't see any "Retry partition exchange messages". I let it sit  
for a couple hours and everything stayed up and happy. I then started  
running pi estimator with increasing number of mappers. I think it is  
when I was doing 10000 mappers that it got about 71% through and then  
stopped making progress although I kept seeing the ignite messages for  
inter node communication. When I noticed it was "stuck" then there was  
an NIO exception in the logs. I haven't looked at the logs in detail  
yet but the topology seemed intact and everything was up and running  
well over 12 hours.

I might need to put this on the back burner for a little bit, we'll see.


Quoting Denis Magda <dmagda@gridgain.com>:

> Joe,
> Thanks for the clarifications. Now we're on the same page.
> It's great that the cluster is initially assembled without any issue and you
> see that all 64 joined the topology.
> In regards to 'rebalancing timeout' warnings I have the following thoughts.
> First, I've opened a bug that describes your and similar cases that happens
> on big cluster with rebalancing. You may want to track it:
> https://issues.apache.org/jira/browse/IGNITE-1837
> Second, I'm not sure that this bug is 100% your case and doesn't guarantee
> that the issue on your side disappears when it gets fixed. That's why lets
> check the following.
> 1) As far as I remember before we decreased the port range used by discovery
> it took significant time for you to form the cluster of 64 nodes. What are
> the settings of your network (throughput, 10GB or 1GB)? How do you use this
> servers? Are they already under the load by some other apps that decrease
> network throughput? I think you should find out whether everything is OK in
> this area or not. IMHO at least the situation is not ideal.
> 2) Please increate TcpCommunicationSpi.socketWriteTimeout to 15 secs (the
> same value that failureDetectionTimeout has).
> Actually you may want to try configuring network related parameters directly
> instead of relying on failureDetectionTimeout:
> - TcpCommunicationSpi.socketWriteTimeout
> - TcpCommunicationSpi.connectTimeout
> - TcpDiscoverySpi.socketTimeout
> - TcpDiscoverySpi.ackTimeout
> 3) In some logs I see that IGFS endpoint failed to start. Please check who
> occupies that port number.
> [07:33:41,736][WARN ][main][IgfsServerManager] Failed to start IGFS endpoint
> (will retry every 3s). Failed to bind to port (is port already in use?):
> 10500
> 4) Please turn off IGFS/HDFS/Hadoop at all and start the cluster. Let's
> check how long it will live in the idle state. But please take into account
> 1) before.
> Regards,
> Denis
> --
> View this message in context:  
> http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-tp1692p1814.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.

View raw message