Mailing-List: contact user-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@ignite.apache.org
MIME-Version: 1.0
In-Reply-To: <5630D5F4.4070302@gridgain.com>
References: <20151026035652.Horde.xLpeXZvcANKRhayZqYgdkA1@www.eiler.net>
 <562DD638.7000209@gridgain.com>
 <20151026145342.Horde.0nyQ23JSBrMzzjAGN2wrCQ1@www.eiler.net>
 <20151027063726.Horde.bewrnEtBluYGVyCBIeKCrw1@www.eiler.net>
 <562F3AD8.6030902@gridgain.com>
 <20151028104033.Horde.3A0WSbFq9t6fdEgHmAueLw5@www.eiler.net>
 <5630D5F4.4070302@gridgain.com>
From: Dmitriy Setrakyan <dsetrakyan@apache.org>
Date: Wed, 28 Oct 2015 22:46:00 -0700
Message-ID: 
 <CA+0=VoUmTTsDwpkn_gpapPK6+vh6-ALoOoD=OW6KMrMsLDf-5A@mail.gmail.com>
Subject: Re: Help with tuning for larger clusters
To: user <user@ignite.apache.org>
Cc: dev@eiler.net
Content-Type: multipart/alternative; boundary=001a11c30b94c3f3c1052337d529

--001a11c30b94c3f3c1052337d529
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Should we add some of these performance and tuning tips to our
documentation?

http://apacheignite.gridgain.org/docs/performance-tips

D.

On Wed, Oct 28, 2015 at 7:04 AM, Denis Magda <dmagda@gridgain.com> wrote:

> Hi Joe,
>
> No problems, I'll guide you until we get to the bottom.
>
> Do you start pre-loading the caches with data right after the cluster is
> ready? If so let's postpone doing this until you have a stable cluster wi=
th
> caches rebalanced and ready to be used.
>
> Please, do the following as the next steps:
>
> 1) Set 'failureDetectionTimeout' to a bigger value (~ 15 secs);
>
> 2) Set CacheConfiguration.setRebalanceTimeout to a value that is
> approximately equal to the time when all the nodes are joined the topolog=
y
> (~ 1 minute or so).
>
> 3) Enable verbose logging for every node by passing -DIGNITE_QUEIT=3Dfals=
e
> parameter to virtual machine arguments list. If you use ignite.sh script
> then just pass '-v' flag.
>
> 4) Enable garbage collection logs for every node by passing this string t=
o
> virtual machine arguments list -Xloggc:./gc.log -XX:+PrintGCDetails
> -verbose:gc
>
> When you did a test run taking into account all the points above please
> gather all the logs (including garbage collection logs) and send us for
> further investigation.
>
> Regards,
> Denis
>
>
> On 10/28/2015 1:40 PM, dev@eiler.net wrote:
>
> Thanks for the info Denis.
>
> Removing the failureDetectionTimeout and using the networkTimeout seems t=
o
> allow the nodes to join the topology in about the same amount of time. I'=
m
> still only having occasional success running anything (even just the pi
> estimator)
>
> I seem to always see a bunch of warnings...a summary is dumped below alon=
g
> with my config at the end, any guidance you can provide is appreciated.
>
> Thanks,
> Joe
>
> Every node seems to see a bunch of "Retrying preload partition", with the
> lowest locNodeOrder having fewer nodes in the remaining
>
> [14:52:38,979][WARN
> ][ignite-#104%sys-null%][GridDhtPartitionsExchangeFuture] Retrying preloa=
d
> partition exchange due to timeout [done=3Dfalse, dummy=3Dfalse,
> exchId=3DGridDhtPartitionExchangeId [topVer=3DAffinityTopologyVersion
> [topVer=3D62, minorTopVer=3D0], nodeId=3Dfd9620f5, evt=3DNODE_JOINED], rc=
vdIds=3D[],
> rmtIds=3D[0ab29a08, 5216f6ba, f882885f, 0d232f1a, b74f5ebb, 5790761a,
> 55d2082e, b1bf93b3, 2fd79f9f, a899ccce, 3dd74aba, 320d05fd, 0d44a4b3,
> 9a00f235, 4426467e, 7837fdfc, e8778da0, 4a988e3e, f8cabdbb, 494ad6fd,
> 7c05abfb, 5902c851, c406028e, a0b57685, e213b903, c85a0b46, df981c08,
> 187cd54f, f0b7b298, 94ec7576, 041975f5, aecba5d0, 5549256d, f9b5a77a,
> 596d0df7, 26266d8c, 0e664e25, 97d112b2, aac08043, 6b81a2b1, 5a2a1012,
> 534ac94b, b34cb942, 837785eb, 966d70b2, 3aab732e, 4e34ad89, 6df0ffff,
> 4c7c3c47, 85eea5fe, 1c5e2f6b, 3f426f4e, 27a9bef9, cd874e96, dc3256a7,
> 4da50521, 1d370c9e, 19c334eb, 24be15dd, 6c922af3, 01ea2812],
> remaining=3D[0ab29a08, 5216f6ba, f882885f, 0d232f1a, b74f5ebb, 5790761a,
> 55d2082e, b1bf93b3, 2fd79f9f, a899ccce, 3dd74aba, 320d05fd, 0d44a4b3,
> 9a00f235, 4426467e, 7837fdfc, e8778da0, 4a988e3e, f8cabdbb, 494ad6fd,
> 7c05abfb, 5902c851, c406028e, a0b57685, e213b903, c85a0b46, df981c08,
> 187cd54f, f0b7b298, 94ec7576, 041975f5, aecba5d0, 5549256d, f9b5a77a,
> 596d0df7, 26266d8c, 0e664e25, 97d112b2, aac08043, 6b81a2b1, 5a2a1012,
> 534ac94b, b34cb942, 837785eb, 966d70b2, 3aab732e, 4e34ad89, 6df0ffff,
> 4c7c3c47, 85eea5fe, 1c5e2f6b, 3f426f4e, 27a9bef9, cd874e96, dc3256a7,
> 4da50521, 1d370c9e, 19c334eb, 24be15dd, 6c922af3, 01ea2812], init=3Dtrue,
> initFut=3Dtrue, ready=3Dtrue, replied=3Dfalse, added=3Dtrue, oldest=3D0d4=
4a4b3,
> oldestOrder=3D1, evtLatch=3D0, locNodeOrder=3D62,
> locNodeId=3Dfd9620f5-3ebb-4a71-a482-73d6a81b1688]
>
>
> [14:38:41,893][WARN
> ][ignite-#95%sys-null%][GridDhtPartitionsExchangeFuture] Retrying preload
> partition exchange due to timeout [done=3Dfalse, dummy=3Dfalse,
> exchId=3DGridDhtPartitionExchangeId [topVer=3DAffinityTopologyVersion
> [topVer=3D25, minorTopVer=3D0], nodeId=3Ddf981c08, evt=3DNODE_JOINED],
> rcvdIds=3D[7c05abfb, b34cb942, e213b903, 320d05fd, 5902c851, f0b7b298,
> 1d370c9e, 0d232f1a, 494ad6fd, 5a2a1012, b1bf93b3, 55d2082e, 7837fdfc,
> 85eea5fe, 4e34ad89, 5790761a, 3f426f4e, aac08043, 187cd54f, 01ea2812,
> c406028e, 24be15dd, 966d70b2], rmtIds=3D[0d232f1a, 5790761a, 55d2082e,
> b1bf93b3, aac08043, 5a2a1012, b34cb942, 320d05fd, 966d70b2, 4e34ad89,
> 85eea5fe, 7837fdfc, 3f426f4e, 1d370c9e, 494ad6fd, 7c05abfb, 5902c851,
> c406028e, 24be15dd, e213b903, df981c08, 187cd54f, f0b7b298, 01ea2812],
> remaining=3D[df981c08], init=3Dtrue, initFut=3Dtrue, ready=3Dtrue, replie=
d=3Dfalse,
> added=3Dtrue, oldest=3D0d44a4b3, oldestOrder=3D1, evtLatch=3D0, locNodeOr=
der=3D1,
> locNodeId=3D0d44a4b3-4d10-4f67-b8bd-005be226b1df]
>
>
> I also see a little over half the nodes getting "Still waiting for initia=
l
> partition map exchange" warnings like this
>
>
> [14:39:37,848][WARN ][main][GridCachePartitionExchangeManager] Still
> waiting for initial partition map exchange
> [fut=3DGridDhtPartitionsExchangeFuture [dummy=3Dfalse, forcePreload=3Dfal=
se,
> reassign=3Dfalse, discoEvt=3DDiscoveryEvent [evtNode=3DTcpDiscoveryNode
> [id=3D27a9bef9-de04-486d-aac0-bfa749e9007d, addrs=3D[0:0:0:0:0:0:0:1%1,
> 10.148.0.87, 10.159.1.182, 127.0.0.1], sockAddrs=3D[
> r1i4n10.redacted.com/10.148.0.87:47500, /0:0:0:0:0:0:0:1%1:47500, /
> 10.159.1.182:47500, /10.148.0.87:47500, /10.159.1.182:47500, /
> 127.0.0.1:47500], discPort=3D47500, order=3D48, intOrder=3D48,
> lastExchangeTime=3D1445974777828, loc=3Dtrue, ver=3D1.4.0#19691231-sha1:0=
0000000,
> isClient=3Dfalse], topVer=3D48, nodeId8=3D27a9bef9, msg=3Dnull, type=3DNO=
DE_JOINED,
> tstamp=3D1445974647187], rcvdIds=3DGridConcurrentHashSet [elements=3D[]],
> rmtIds=3D[0ab29a08-9c95-4054-8035-225f5828b3d4,
> 0d232f1a-0f46-4798-a39a-63a17dc4dc7f, f9b5a77a-a4c1-46aa-872e-aeaca9b76ee=
3,
> 596d0df7-3edf-4078-8f4a-ffa3d96296c6, 5790761a-aeeb-44d1-9fce-3fee31ef39b=
7,
> 55d2082e-3517-4828-8d47-57b4ed5a41bc, 26266d8c-cc87-4472-9fa4-c526d6da223=
3,
> 0e664e25-8dde-4df8-966b-53b60f9a1087, b1bf93b3-24bb-4520-ade0-31d05a93558=
d,
> aac08043-875a-485a-ab2c-cd7e66d68f8f, 2fd79f9f-9590-41d2-962e-004a3d7690b=
5,
> 5a2a1012-0766-448c-9583-25873c305de9, 534ac94b-8dd1-4fa8-a481-539fa4f4ce5=
5,
> b34cb942-e960-4a00-b4fb-10add6466a93, 320d05fd-e021-40ac-83bc-62f54756771=
b,
> 0d44a4b3-4d10-4f67-b8bd-005be226b1df, 837785eb-24e0-496a-a0cc-f795b64b592=
9,
> 9a00f235-3b6a-4be5-b0e3-93cd1beacaf4, 966d70b2-e1dc-4e20-9876-b63736545ab=
d,
> 3aab732e-a075-4b19-9525-e97a1260a4fe, 4e34ad89-fa46-4503-a599-b8c937ca1f4=
7,
> 4c7c3c47-6e5c-4c15-80a9-408192596bc2, 85eea5fe-9aff-4821-970c-4ce006ee853=
a,
> 7837fdfc-6255-4784-8088-09d4e6e37bb9, 3f426f4e-2d0c-402a-a4af-9d7656f4648=
4,
> e8778da0-a764-4ad9-afba-8a748564e12a, 4a988e3e-3434-4271-acd6-af2a1e30524=
c,
> cd874e96-63cf-41c9-8e8a-75f3223bfe9d, f8cabdbb-875a-480b-8b5e-4b5313c5fcb=
d,
> dc3256a7-ae23-4c2e-b375-55e2884e045d, 4da50521-aad0-48a4-9f79-858bbc2e6b8=
9,
> 1d370c9e-250f-4733-8b8a-7b6f5c6e1b2b, 494ad6fd-1637-44b8-8d3a-1fa19681ba6=
4,
> 7c05abfb-dba1-43c3-a8b1-af504762ec60, 5902c851-5275-41fd-89c4-cd6390c8867=
0,
> 19c334eb-5661-4697-879d-1082571dfef8, c406028e-768e-404e-8417-40d2960c4ba=
3,
> a0b57685-e5dc-498c-99a4-33b1aef32632, 24be15dd-45f7-4980-b4f8-3176ab67e8f=
6,
> e213b903-107b-4465-8fe1-78b7b393d631, df981c08-148d-4266-9ea7-16316801296=
8,
> 187cd54f-396b-4c3c-9bfc-9883ac37f556, f0b7b298-6432-477a-85a0-83e29e8c538=
0,
> 94ec7576-7a02-4c08-8739-4e0fc52a3d3a, 041975f5-990a-4792-b384-eded3296678=
3,
> 01ea2812-5005-4152-af2e-2586bf65b4c6,
> aecba5d0-9d9b-4ab6-9018-62f5abb7b809], exchId=3DGridDhtPartitionExchangeI=
d
> [topVer=3DAffinityTopologyVersion [topVer=3D48, minorTopVer=3D0],
> nodeId=3D27a9bef9, evt=3DNODE_JOINED], init=3Dtrue, ready=3Dtrue, replied=
=3Dfalse,
> added=3Dtrue, initFut=3DGridFutureAdapter [resFlag=3D2, res=3Dtrue,
> startTime=3D1445974657836, endTime=3D1445974658400, ignoreInterrupts=3Dfa=
lse,
> lsnr=3Dnull, state=3DDONE], topSnapshot=3Dnull, lastVer=3Dnull,
> partReleaseFut=3DGridCompoundFuture [lsnrCalls=3D3, finished=3Dtrue, rdc=
=3Dnull,
> init=3Dtrue, res=3Djava.util.concurrent.atomic.AtomicMarkableReference@6b=
58be0e,
> err=3Dnull, done=3Dtrue, cancelled=3Dfalse, err=3Dnull, futs=3D[true, tru=
e, true]],
> skipPreload=3Dfalse, clientOnlyExchange=3Dfalse,
> oldest=3D0d44a4b3-4d10-4f67-b8bd-005be226b1df, oldestOrder=3D1, evtLatch=
=3D0,
> remaining=3D[0ab29a08-9c95-4054-8035-225f5828b3d4,
> 0d232f1a-0f46-4798-a39a-63a17dc4dc7f, f9b5a77a-a4c1-46aa-872e-aeaca9b76ee=
3,
> 596d0df7-3edf-4078-8f4a-ffa3d96296c6, 5790761a-aeeb-44d1-9fce-3fee31ef39b=
7,
> 55d2082e-3517-4828-8d47-57b4ed5a41bc, 26266d8c-cc87-4472-9fa4-c526d6da223=
3,
> 0e664e25-8dde-4df8-966b-53b60f9a1087, b1bf93b3-24bb-4520-ade0-31d05a93558=
d,
> aac08043-875a-485a-ab2c-cd7e66d68f8f, 2fd79f9f-9590-41d2-962e-004a3d7690b=
5,
> 5a2a1012-0766-448c-9583-25873c305de9, 534ac94b-8dd1-4fa8-a481-539fa4f4ce5=
5,
> b34cb942-e960-4a00-b4fb-10add6466a93, 320d05fd-e021-40ac-83bc-62f54756771=
b,
> 0d44a4b3-4d10-4f67-b8bd-005be226b1df, 837785eb-24e0-496a-a0cc-f795b64b592=
9,
> 9a00f235-3b6a-4be5-b0e3-93cd1beacaf4, 966d70b2-e1dc-4e20-9876-b63736545ab=
d,
> 3aab732e-a075-4b19-9525-e97a1260a4fe, 4e34ad89-fa46-4503-a599-b8c937ca1f4=
7,
> 4c7c3c47-6e5c-4c15-80a9-408192596bc2, 85eea5fe-9aff-4821-970c-4ce006ee853=
a,
> 7837fdfc-6255-4784-8088-09d4e6e37bb9, 3f426f4e-2d0c-402a-a4af-9d7656f4648=
4,
> e8778da0-a764-4ad9-afba-8a748564e12a, 4a988e3e-3434-4271-acd6-af2a1e30524=
c,
> cd874e96-63cf-41c9-8e8a-75f3223bfe9d, f8cabdbb-875a-480b-8b5e-4b5313c5fcb=
d,
> dc3256a7-ae23-4c2e-b375-55e2884e045d, 4da50521-aad0-48a4-9f79-858bbc2e6b8=
9,
> 1d370c9e-250f-4733-8b8a-7b6f5c6e1b2b, 494ad6fd-1637-44b8-8d3a-1fa19681ba6=
4,
> 7c05abfb-dba1-43c3-a8b1-af504762ec60, 5902c851-5275-41fd-89c4-cd6390c8867=
0,
> 19c334eb-5661-4697-879d-1082571dfef8, c406028e-768e-404e-8417-40d2960c4ba=
3,
> a0b57685-e5dc-498c-99a4-33b1aef32632, 24be15dd-45f7-4980-b4f8-3176ab67e8f=
6,
> e213b903-107b-4465-8fe1-78b7b393d631, df981c08-148d-4266-9ea7-16316801296=
8,
> 187cd54f-396b-4c3c-9bfc-9883ac37f556, f0b7b298-6432-477a-85a0-83e29e8c538=
0,
> 94ec7576-7a02-4c08-8739-4e0fc52a3d3a, 041975f5-990a-4792-b384-eded3296678=
3,
> 01ea2812-5005-4152-af2e-2586bf65b4c6,
> aecba5d0-9d9b-4ab6-9018-62f5abb7b809], super=3DGridFutureAdapter [resFlag=
=3D0,
> res=3Dnull, startTime=3D1445974657836, endTime=3D0, ignoreInterrupts=3Dfa=
lse,
> lsnr=3Dnull, state=3DINIT]]]
>
>
>
> Then on the occasions when mapreduce jobs fail I will see one node with
> (it isn't always the same node)
>
>
> [14:52:57,080][WARN ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi]
> Closing NIO session because of unhandled exception [cls=3Dclass
> o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
> [14:52:59,123][WARN ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi]
> Failed to process selector key (will close): GridSelectorNioSessionImpl
> [selectorIdx=3D3, queueSize=3D0, writeBuf=3Djava.nio.DirectByteBuffer[pos=
=3D0
> lim=3D32768 cap=3D32768], readBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=
=3D32768
> cap=3D32768], recovery=3DGridNioRecoveryDescriptor [acked=3D3, resendCnt=
=3D0,
> rcvCnt=3D0, reserved=3Dtrue, lastAck=3D0, nodeLeft=3Dfalse, node=3DTcpDis=
coveryNode
> [id=3D837785eb-24e0-496a-a0cc-f795b64b5929, addrs=3D[0:0:0:0:0:0:0:1%1,
> 10.148.0.81, 10.159.1.176, 127.0.0.1], sockAddrs=3D[/10.159.1.176:47500,
> /0:0:0:0:0:0:0:1%1:47500, r1i4n4.redacted.com/10.148.0.81:47500, /
> 10.148.0.81:47500, /10.159.1.176:47500, /127.0.0.1:47500],
> discPort=3D47500, order=3D45, intOrder=3D45, lastExchangeTime=3D144597462=
5750,
> loc=3Dfalse, ver=3D1.4.0#19691231-sha1:00000000, isClient=3Dfalse],
> connected=3Dtrue, connectCnt=3D1, queueLimit=3D5120], super=3DGridNioSess=
ionImpl
> [locAddr=3D/10.159.1.112:46222, rmtAddr=3D/10.159.1.176:47100,
> createTime=3D1445974646591, closeTime=3D0, bytesSent=3D30217, bytesRcvd=
=3D9,
> sndSchedTime=3D1445975577912, lastSndTime=3D1445975577912,
> lastRcvTime=3D1445974655114, readsPaused=3Dfalse,
> filterChain=3DFilterChain[filters=3D[GridNioCodecFilter
> [parser=3Do.a.i.i.util.nio.GridDirectParser@44de55ba, directMode=3Dtrue],
> GridConnectionBytesVerifyFilter], accepted=3Dfalse]]
> [14:52:59,124][WARN ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi]
> Closing NIO session because of unhandled exception [cls=3Dclass
> o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
> [14:53:00,105][WARN ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi]
> Failed to process selector key (will close): GridSelectorNioSessionImpl
> [selectorIdx=3D3, queueSize=3D0, writeBuf=3Djava.nio.DirectByteBuffer[pos=
=3D0
> lim=3D32768 cap=3D32768], readBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=
=3D32768
> cap=3D32768], recovery=3DGridNioRecoveryDescriptor [acked=3D0, resendCnt=
=3D0,
> rcvCnt=3D0, reserved=3Dtrue, lastAck=3D0, nodeLeft=3Dfalse, node=3DTcpDis=
coveryNode
> [id=3D4426467e-b4b4-4912-baa1-d7cc839d9188, addrs=3D[0:0:0:0:0:0:0:1%1,
> 10.148.0.106, 10.159.1.201, 127.0.0.1], sockAddrs=3D[/10.159.1.201:47500,
> /0:0:0:0:0:0:0:1%1:47500, r1i5n11.redacted.com/10.148.0.106:47500, /
> 10.148.0.106:47500, /10.159.1.201:47500, /127.0.0.1:47500],
> discPort=3D47500, order=3D57, intOrder=3D57, lastExchangeTime=3D144597462=
5790,
> loc=3Dfalse, ver=3D1.4.0#19691231-sha1:00000000, isClient=3Dfalse],
> connected=3Dtrue, connectCnt=3D1, queueLimit=3D5120], super=3DGridNioSess=
ionImpl
> [locAddr=3D/10.159.1.112:60869, rmtAddr=3D/10.159.1.201:47100,
> createTime=3D1445974654478, closeTime=3D0, bytesSent=3D22979, bytesRcvd=
=3D0,
> sndSchedTime=3D1445975577912, lastSndTime=3D1445975577912,
> lastRcvTime=3D1445974654478, readsPaused=3Dfalse,
> filterChain=3DFilterChain[filters=3D[GridNioCodecFilter
> [parser=3Do.a.i.i.util.nio.GridDirectParser@44de55ba, directMode=3Dtrue],
> GridConnectionBytesVerifyFilter], accepted=3Dfalse]]
> [14:53:00,105][WARN ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi]
> Closing NIO session because of unhandled exception [cls=3Dclass
> o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
>
>
> I've tried adjusting the timeout settings further but haven't had much
> success.
>
> Here is what my config looks like, it is obviously heavily based off the
> hadoop example config.
>
>
> <?xml version=3D"1.0" encoding=3D"UTF-8"?>
> <beans ns1:schemaLocation=3D"http://www.springframework.org/schema/beans
>        http://www.springframework.org/schema/beans/spring-beans.xsd
> http://www.springframework.org/schema/util
> http://www.springframework.org/schema/util/spring-util.xsd"
> <http://www.springframework.org/schema/beans%C2%A0%C2%A0%C2%A0%C2%A0%C2%A=
0%C2%A0http://www.springframework.org/schema/beans/spring-beans.xsd%C2%A0%C=
2%A0%C2%A0%C2%A0%C2%A0%C2%A0http://www.springframework.org/schema/util%C2%A=
0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0http://www.springframework.org/schema/=
util/spring-util.xsd>
> xmlns=3D"http://www.springframework.org/schema/beans"
> <http://www.springframework.org/schema/beans> xmlns:ns1=3D
> "http://www.w3.org/2001/XMLSchema-instance"
> <http://www.w3.org/2001/XMLSchema-instance>>
>   <description>
>         Spring file for Ignite node configuration with IGFS and Apache
> Hadoop map-reduce support enabled.
>         Ignite node will start with this configuration by default.
>     </description>
>   <bean
> class=3D"org.springframework.beans.factory.config.PropertyPlaceholderConf=
igurer"
> id=3D"propertyConfigurer">
>     <property name=3D"systemPropertiesModeName"
> value=3D"SYSTEM_PROPERTIES_MODE_FALLBACK" />
>     <property name=3D"searchSystemEnvironment" value=3D"true" />
>   </bean>
>   <bean abstract=3D"true"
> class=3D"org.apache.ignite.configuration.FileSystemConfiguration"
> id=3D"igfsCfgBase">
>     <property name=3D"blockSize" value=3D"#{128 * 1024}" />
>     <property name=3D"perNodeBatchSize" value=3D"512" />
>     <property name=3D"perNodeParallelBatchCount" value=3D"16" />
>     <property name=3D"prefetchBlocks" value=3D"32" />
>   </bean>
>   <bean abstract=3D"true"
> class=3D"org.apache.ignite.configuration.CacheConfiguration"
> id=3D"dataCacheCfgBase">
>     <property name=3D"cacheMode" value=3D"PARTITIONED" />
>     <property name=3D"atomicityMode" value=3D"TRANSACTIONAL" />
>     <property name=3D"writeSynchronizationMode" value=3D"FULL_SYNC" />
>     <property name=3D"backups" value=3D"0" />
>     <property name=3D"affinityMapper">
>       <bean class=3D"org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper"=
>
>         <constructor-arg value=3D"512" />
>       </bean>
>     </property>
>     <property name=3D"startSize" value=3D"#{100*1024*1024}" />
>     <property name=3D"offHeapMaxMemory" value=3D"0" />
>   </bean>
>   <bean abstract=3D"true"
> class=3D"org.apache.ignite.configuration.CacheConfiguration"
> id=3D"metaCacheCfgBase">
>     <property name=3D"cacheMode" value=3D"REPLICATED" />
>     <property name=3D"atomicityMode" value=3D"TRANSACTIONAL" />
>     <property name=3D"writeSynchronizationMode" value=3D"FULL_SYNC" />
>   </bean>
>   <bean class=3D"org.apache.ignite.configuration.IgniteConfiguration"
> id=3D"grid.cfg">
>     <property name=3D"failureDetectionTimeout" value=3D"3000" />
>     <property name=3D"hadoopConfiguration">
>       <bean class=3D"org.apache.ignite.configuration.HadoopConfiguration"=
>
>         <property name=3D"finishedJobInfoTtl" value=3D"30000" />
>       </bean>
>     </property>
>     <property name=3D"connectorConfiguration">
>       <bean
> class=3D"org.apache.ignite.configuration.ConnectorConfiguration">
>         <property name=3D"port" value=3D"11211" />
>       </bean>
>     </property>
>     <property name=3D"fileSystemConfiguration">
>       <list>
>         <bean
> class=3D"org.apache.ignite.configuration.FileSystemConfiguration"
> parent=3D"igfsCfgBase">
>           <property name=3D"name" value=3D"igfs" />
>           <property name=3D"metaCacheName" value=3D"igfs-meta" />
>           <property name=3D"dataCacheName" value=3D"igfs-data" />
>           <property name=3D"ipcEndpointConfiguration">
>             <bean
> class=3D"org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
>               <property name=3D"type" value=3D"TCP" />
>               <property name=3D"host" value=3D"r1i0n12" />
>               <property name=3D"port" value=3D"10500" />
>             </bean>
>           </property>
>         </bean>
>       </list>
>     </property>
>     <property name=3D"cacheConfiguration">
>       <list>
>         <bean class=3D"org.apache.ignite.configuration.CacheConfiguration=
"
> parent=3D"metaCacheCfgBase">
>           <property name=3D"name" value=3D"igfs-meta" />
>         </bean>
>         <bean class=3D"org.apache.ignite.configuration.CacheConfiguration=
"
> parent=3D"dataCacheCfgBase">
>           <property name=3D"name" value=3D"igfs-data" />
>         </bean>
>       </list>
>     </property>
>     <property name=3D"includeEventTypes">
>       <list>
>         <ns2:constant
> static-field=3D"org.apache.ignite.events.EventType.EVT_TASK_FAILED" xmlns=
:ns2=3D
> "http://www.springframework.org/schema/util"
> <http://www.springframework.org/schema/util> />
>         <ns2:constant
> static-field=3D"org.apache.ignite.events.EventType.EVT_TASK_FINISHED"
> xmlns:ns2=3D"http://www.springframework.org/schema/util"
> <http://www.springframework.org/schema/util> />
>         <ns2:constant
> static-field=3D"org.apache.ignite.events.EventType.EVT_JOB_MAPPED" xmlns:=
ns2=3D
> "http://www.springframework.org/schema/util"
> <http://www.springframework.org/schema/util> />
>       </list>
>     </property>
>     <property name=3D"discoverySpi">
>       <bean class=3D"org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"=
>
>         <property name=3D"ipFinder">
>           <bean
> class=3D"org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIp=
Finder">
>             <property name=3D"addresses">
>               <list>
>                 <value>r1i0n12:47500</value>
>               </list>
>             </property>
>           </bean>
>         </property>
>       </bean>
>     </property>
>   </bean>
> </beans>
>
>
>
> Quoting Denis Magda <dmagda@gridgain.com> <dmagda@gridgain.com>:
>
> Hi Joe,
>
> Great!
>
> Please see below
>
> On 10/27/2015 9:37 AM, dev@eiler.net wrote:
>
> Reducing the port range (to a single port) and lowering the
> IgniteConfiguration.setFailureDetectionTimeout to 1000 helped speed up
> everybody joining the topology and I was able to get a pi estimator run o=
n
> 64 nodes.
>
>
> I suspect that the reason was in the number of ports specified in the
> range.  By some reason it takes significant time to get a response from
> TCP/IP stack that a connection can't be established on a particular port
> number.
> Please try to reduce the port range, lower
> TcpDiscoverySpi.setNetworkTimeout, keep
> IgniteConfiguration.setFailureDetectionTimeout's default value and share
> results with us.
>
> Thanks again for the help, I'm over the current hurdle.
> Joe
>
>
> Quoting dev@eiler.net:
>
> Thanks for the quick response Denis.
>
> I did a port range of 10 ports. I'll take a look at the
> failureDetectionTimeout and networkTimeout.
>
> Side question: Is there an easy way to map between the programmatic API
> and the spring XML properties? For instance I was trying to find the
> correct xml incantation for TcpDiscoverySpi.setMaxMissedHeartbeats(int) a=
nd
> I might have a similar issue finding
> IgniteConfiguration.setFailureDetectionTimeout(long). It seems like I can
> usually drop the set and adjust capitalization (setFooBar() =3D=3D <prope=
rty
> name=3D"fooBar")
>
> Yes, your understanding is correct.
>
> Please pardon my ignorance on terminology:
> Are the nodes I run ignite.sh on considered server nodes or cluster nodes
> (I would have thought they are the same)
>
> Actually we have a notion of server and client nodes. This page contains
> extensive information on the type of nodes:
> https://apacheignite.readme.io/docs/clients-vs-servers
>
> A cluster node is just a server or client node.
>
> Regards,
> Denis
>
> Thanks,
> Joe
>
> Quoting Denis Magda <dmagda@gridgain.com> <dmagda@gridgain.com>:
>
> Hi Joe,
>
> How big is a port range, that you specified in your discovery
> configuration, for a every single node?
> Please take into account that the discovery may iterate over every port
> from the range before one node connects to the other and depending on the
> TCP related settings of your network it may take significant time before
> the cluster is assembled.
>
> Here I would recommend you to reduce the port range as much as possible
> and to play with the following network related parameters:
> - Try to use the failure detection timeout instead of setting socket, ack
> and many other timeouts explicitly (
> https://apacheignite.readme.io/docs/cluster-config#failure-detection-time=
out
> );
> - Try to play with TcpDiscoverySpi.networkTimeout because this timeout is
> considered during the time when a cluster node tries to join a cluster.
>
> In order to help you with the hanging compute tasks and to give you more
> specific recommendations regarding the slow join process please provide u=
s
> with the following:
> - config files for server and cluster nodes;
> - log files from all the nodes. Please start the nodes with
> -DIGNITE_QUIET=3Dfalse virtual machine property. If you start the nodes u=
sing
> ignite.sh/bat then just pass '-v' as an argument to the script.
> - thread dumps for the nodes that are hanging waiting for the compute
> tasks to be completed.
>
> Regards,
> Denis
>
> On 10/26/2015 6:56 AM, dev@eiler.net wrote:
>
> Hi all,
>
> I have been experimenting with ignite and have run into a problem scaling
> up to larger clusters.
>
> I am playing with only two different use cases, 1) a Hadoop MapReduce
> accelerator 2) an in memory data grid (no secondary file system) being
> accessed by frameworks using the HDFS
>
> Everything works fine with a smaller cluster (8 nodes) but with a larger
> cluster (64 nodes) it takes a couple of minutes for all the nodes to
> register with the cluster(which would be ok) and mapreduce jobs just hang
> and never return.
>
> I've compiled the latest Ignite 1.4 (with ignite.edition=3Dhadoop) from
> source, and am using it with Hadoop 2.7.1 just trying to run things like
> the pi estimator and wordcount examples.
>
> I started with the config/hadoop/default-config.xml
>
> I can't use multicast so I've configured it to use static IP based
> discovery with just a single node/port range.
>
> I've increased the heartbeat frequency to 10000 and that seemed to help
> make things more stable once all the nodes do join the cluster. I've also
> played with increasing both the socket timeout and the ack timeout but th=
at
> seemed to just make it take longer for nodes to attempt to join the clust=
er
> after a failed attempt.
>
> I have access to a couple of different clusters, we allocate resources
> with slurm so I get a piece of a cluster to play with (hence the
> no-multicast restriction). The nodes all have fast networks (FDR
> InfiniBand) and a decent amount of memory (64GB-128GB) but no local stora=
ge
> (or swap space).
>
> As mentioned earlier, I disable the secondaryFilesystem.
>
> Any advice/hints/example xml configs would be extremely welcome.
>
>
> I also haven't been seeing the expected performance using the hdfs api to
> access ignite. I've tried both using the hdfs cli to do some simple timin=
gs
> of put/get and a little java program that writes then reads a file.  Even
> with small files (500MB) that should be kept completely in a single node,=
 I
> only see about 250MB/s for writes and reads are much slower than that (4x
> to 10x).  The writes are better than hdfs (our hdfs is backed with pretty
> poor storage) but reads are much slower. Now I haven't tried scaling this
> at all but with an 8 node ignite cluster and a single "client" access a
> single file I would hope for something closer to memory speeds. (if you
> would like me to split this into another message to the list just let me
> know, I'm assuming the cause it the same---I missed a required config
> setting ;-) )
>
> Thanks in advance for any help,
> Joe
>
>
>
>
>
>
>
>
>
>
>
>

--001a11c30b94c3f3c1052337d529
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Should we add some of these performance and tuning tips to=
 our documentation?<div><br></div><div><a href=3D"http://apacheignite.gridg=
ain.org/docs/performance-tips">http://apacheignite.gridgain.org/docs/perfor=
mance-tips</a><br></div><div><br></div><div>D.</div></div><div class=3D"gma=
il_extra"><br><div class=3D"gmail_quote">On Wed, Oct 28, 2015 at 7:04 AM, D=
enis Magda <span dir=3D"ltr">&lt;<a href=3D"mailto:dmagda@gridgain.com" tar=
get=3D"_blank">dmagda@gridgain.com</a>&gt;</span> wrote:<br><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Hi Joe,<br>
    <br>
    No problems, I&#39;ll guide you until we get to the bottom.<br>
    <br>
    Do you start pre-loading the caches with data right after the
    cluster is ready? If so let&#39;s postpone doing this until you have a
    stable cluster with caches rebalanced and ready to be used.<br>
    <br>
    Please, do the following as the next steps:<br>
    <br>
    1) Set &#39;failureDetectionTimeout&#39; to a bigger value (~ 15 secs);=
<br>
    <br>
    2) Set CacheConfiguration.setRebalanceTimeout to a value that is
    approximately equal to the time when all the nodes are joined the
    topology (~ 1 minute or so).<br>
    <br>
    3) Enable verbose logging for every node by passing
    -DIGNITE_QUEIT=3Dfalse parameter to virtual machine arguments list. If
    you use ignite.sh script then just pass &#39;-v&#39; flag.<br>
    <br>
    4) Enable garbage collection logs for every node by passing this
    string to virtual machine arguments list <span style=3D"color:rgb(51,51=
,51);font-family:verdana,arial,sans-serif;font-size:13px;font-style:normal;=
font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:16=
.9px;text-align:left;text-indent:0px;text-transform:none;white-space:normal=
;word-spacing:0px;display:inline!important;float:none;background-color:rgb(=
250,250,250)">-Xloggc:./gc.log
      -XX:+PrintGCDetails -verbose:gc</span>=C2=A0 <br>
    <br>
    When you did a test run taking into account all the points above
    please gather all the logs (including garbage collection logs) and
    send us for further investigation.<br>
    <br>
    Regards,<br>
    Denis<div><div class=3D"h5"><br>
    <br>
    <div>On 10/28/2015 1:40 PM, <a href=3D"mailto:dev@eiler.net" target=3D"=
_blank">dev@eiler.net</a>
      wrote:<br>
    </div>
    <blockquote type=3D"cite">Thanks for the info Denis.
      <br>
      <br>
      Removing the failureDetectionTimeout and using the networkTimeout
      seems to allow the nodes to join the topology in about the same
      amount of time. I&#39;m still only having occasional success running
      anything (even just the pi estimator)
      <br>
      <br>
      I seem to always see a bunch of warnings...a summary is dumped
      below along with my config at the end, any guidance you can
      provide is appreciated.
      <br>
      <br>
      Thanks,
      <br>
      Joe
      <br>
      <br>
      Every node seems to see a bunch of &quot;Retrying preload partition&q=
uot;,
      with the lowest locNodeOrder having fewer nodes in the remaining
      <br>
      <br>
      [14:52:38,979][WARN
      ][ignite-#104%sys-null%][GridDhtPartitionsExchangeFuture] Retrying
      preload partition exchange due to timeout [done=3Dfalse,
      dummy=3Dfalse, exchId=3DGridDhtPartitionExchangeId
      [topVer=3DAffinityTopologyVersion [topVer=3D62, minorTopVer=3D0],
      nodeId=3Dfd9620f5, evt=3DNODE_JOINED], rcvdIds=3D[], rmtIds=3D[0ab29a=
08,
      5216f6ba, f882885f, 0d232f1a, b74f5ebb, 5790761a, 55d2082e,
      b1bf93b3, 2fd79f9f, a899ccce, 3dd74aba, 320d05fd, 0d44a4b3,
      9a00f235, 4426467e, 7837fdfc, e8778da0, 4a988e3e, f8cabdbb,
      494ad6fd, 7c05abfb, 5902c851, c406028e, a0b57685, e213b903,
      c85a0b46, df981c08, 187cd54f, f0b7b298, 94ec7576, 041975f5,
      aecba5d0, 5549256d, f9b5a77a, 596d0df7, 26266d8c, 0e664e25,
      97d112b2, aac08043, 6b81a2b1, 5a2a1012, 534ac94b, b34cb942,
      837785eb, 966d70b2, 3aab732e, 4e34ad89, 6df0ffff, 4c7c3c47,
      85eea5fe, 1c5e2f6b, 3f426f4e, 27a9bef9, cd874e96, dc3256a7,
      4da50521, 1d370c9e, 19c334eb, 24be15dd, 6c922af3, 01ea2812],
      remaining=3D[0ab29a08, 5216f6ba, f882885f, 0d232f1a, b74f5ebb,
      5790761a, 55d2082e, b1bf93b3, 2fd79f9f, a899ccce, 3dd74aba,
      320d05fd, 0d44a4b3, 9a00f235, 4426467e, 7837fdfc, e8778da0,
      4a988e3e, f8cabdbb, 494ad6fd, 7c05abfb, 5902c851, c406028e,
      a0b57685, e213b903, c85a0b46, df981c08, 187cd54f, f0b7b298,
      94ec7576, 041975f5, aecba5d0, 5549256d, f9b5a77a, 596d0df7,
      26266d8c, 0e664e25, 97d112b2, aac08043, 6b81a2b1, 5a2a1012,
      534ac94b, b34cb942, 837785eb, 966d70b2, 3aab732e, 4e34ad89,
      6df0ffff, 4c7c3c47, 85eea5fe, 1c5e2f6b, 3f426f4e, 27a9bef9,
      cd874e96, dc3256a7, 4da50521, 1d370c9e, 19c334eb, 24be15dd,
      6c922af3, 01ea2812], init=3Dtrue, initFut=3Dtrue, ready=3Dtrue,
      replied=3Dfalse, added=3Dtrue, oldest=3D0d44a4b3, oldestOrder=3D1,
      evtLatch=3D0, locNodeOrder=3D62,
      locNodeId=3Dfd9620f5-3ebb-4a71-a482-73d6a81b1688]
      <br>
      <br>
      <br>
      [14:38:41,893][WARN
      ][ignite-#95%sys-null%][GridDhtPartitionsExchangeFuture] Retrying
      preload partition exchange due to timeout [done=3Dfalse,
      dummy=3Dfalse, exchId=3DGridDhtPartitionExchangeId
      [topVer=3DAffinityTopologyVersion [topVer=3D25, minorTopVer=3D0],
      nodeId=3Ddf981c08, evt=3DNODE_JOINED], rcvdIds=3D[7c05abfb, b34cb942,
      e213b903, 320d05fd, 5902c851, f0b7b298, 1d370c9e, 0d232f1a,
      494ad6fd, 5a2a1012, b1bf93b3, 55d2082e, 7837fdfc, 85eea5fe,
      4e34ad89, 5790761a, 3f426f4e, aac08043, 187cd54f, 01ea2812,
      c406028e, 24be15dd, 966d70b2], rmtIds=3D[0d232f1a, 5790761a,
      55d2082e, b1bf93b3, aac08043, 5a2a1012, b34cb942, 320d05fd,
      966d70b2, 4e34ad89, 85eea5fe, 7837fdfc, 3f426f4e, 1d370c9e,
      494ad6fd, 7c05abfb, 5902c851, c406028e, 24be15dd, e213b903,
      df981c08, 187cd54f, f0b7b298, 01ea2812], remaining=3D[df981c08],
      init=3Dtrue, initFut=3Dtrue, ready=3Dtrue, replied=3Dfalse, added=3Dt=
rue,
      oldest=3D0d44a4b3, oldestOrder=3D1, evtLatch=3D0, locNodeOrder=3D1,
      locNodeId=3D0d44a4b3-4d10-4f67-b8bd-005be226b1df]
      <br>
      <br>
      <br>
      I also see a little over half the nodes getting &quot;Still waiting f=
or
      initial partition map exchange&quot; warnings like this
      <br>
      <br>
      <br>
      [14:39:37,848][WARN ][main][GridCachePartitionExchangeManager]
      Still waiting for initial partition map exchange
      [fut=3DGridDhtPartitionsExchangeFuture [dummy=3Dfalse,
      forcePreload=3Dfalse, reassign=3Dfalse, discoEvt=3DDiscoveryEvent
      [evtNode=3DTcpDiscoveryNode
      [id=3D27a9bef9-de04-486d-aac0-bfa749e9007d,
      addrs=3D[0:0:0:0:0:0:0:1%1, 10.148.0.87, 10.159.1.182, 127.0.0.1],
      sockAddrs=3D[<a href=3D"http://r1i4n10.redacted.com/10.148.0.87:47500=
" target=3D"_blank">r1i4n10.redacted.com/10.148.0.87:47500</a>,
      /0:0:0:0:0:0:0:1%1:47500, /<a href=3D"http://10.159.1.182:47500" targ=
et=3D"_blank">10.159.1.182:47500</a>, /<a href=3D"http://10.148.0.87:47500"=
 target=3D"_blank">10.148.0.87:47500</a>,
      /<a href=3D"http://10.159.1.182:47500" target=3D"_blank">10.159.1.182=
:47500</a>, /<a href=3D"http://127.0.0.1:47500" target=3D"_blank">127.0.0.1=
:47500</a>], discPort=3D47500, order=3D48,
      intOrder=3D48, lastExchangeTime=3D1445974777828, loc=3Dtrue,
      ver=3D1.4.0#19691231-sha1:00000000, isClient=3Dfalse], topVer=3D48,
      nodeId8=3D27a9bef9, msg=3Dnull, type=3DNODE_JOINED,
      tstamp=3D1445974647187], rcvdIds=3DGridConcurrentHashSet
      [elements=3D[]], rmtIds=3D[0ab29a08-9c95-4054-8035-225f5828b3d4,
      0d232f1a-0f46-4798-a39a-63a17dc4dc7f,
      f9b5a77a-a4c1-46aa-872e-aeaca9b76ee3,
      596d0df7-3edf-4078-8f4a-ffa3d96296c6,
      5790761a-aeeb-44d1-9fce-3fee31ef39b7,
      55d2082e-3517-4828-8d47-57b4ed5a41bc,
      26266d8c-cc87-4472-9fa4-c526d6da2233,
      0e664e25-8dde-4df8-966b-53b60f9a1087,
      b1bf93b3-24bb-4520-ade0-31d05a93558d,
      aac08043-875a-485a-ab2c-cd7e66d68f8f,
      2fd79f9f-9590-41d2-962e-004a3d7690b5,
      5a2a1012-0766-448c-9583-25873c305de9,
      534ac94b-8dd1-4fa8-a481-539fa4f4ce55,
      b34cb942-e960-4a00-b4fb-10add6466a93,
      320d05fd-e021-40ac-83bc-62f54756771b,
      0d44a4b3-4d10-4f67-b8bd-005be226b1df,
      837785eb-24e0-496a-a0cc-f795b64b5929,
      9a00f235-3b6a-4be5-b0e3-93cd1beacaf4,
      966d70b2-e1dc-4e20-9876-b63736545abd,
      3aab732e-a075-4b19-9525-e97a1260a4fe,
      4e34ad89-fa46-4503-a599-b8c937ca1f47,
      4c7c3c47-6e5c-4c15-80a9-408192596bc2,
      85eea5fe-9aff-4821-970c-4ce006ee853a,
      7837fdfc-6255-4784-8088-09d4e6e37bb9,
      3f426f4e-2d0c-402a-a4af-9d7656f46484,
      e8778da0-a764-4ad9-afba-8a748564e12a,
      4a988e3e-3434-4271-acd6-af2a1e30524c,
      cd874e96-63cf-41c9-8e8a-75f3223bfe9d,
      f8cabdbb-875a-480b-8b5e-4b5313c5fcbd,
      dc3256a7-ae23-4c2e-b375-55e2884e045d,
      4da50521-aad0-48a4-9f79-858bbc2e6b89,
      1d370c9e-250f-4733-8b8a-7b6f5c6e1b2b,
      494ad6fd-1637-44b8-8d3a-1fa19681ba64,
      7c05abfb-dba1-43c3-a8b1-af504762ec60,
      5902c851-5275-41fd-89c4-cd6390c88670,
      19c334eb-5661-4697-879d-1082571dfef8,
      c406028e-768e-404e-8417-40d2960c4ba3,
      a0b57685-e5dc-498c-99a4-33b1aef32632,
      24be15dd-45f7-4980-b4f8-3176ab67e8f6,
      e213b903-107b-4465-8fe1-78b7b393d631,
      df981c08-148d-4266-9ea7-163168012968,
      187cd54f-396b-4c3c-9bfc-9883ac37f556,
      f0b7b298-6432-477a-85a0-83e29e8c5380,
      94ec7576-7a02-4c08-8739-4e0fc52a3d3a,
      041975f5-990a-4792-b384-eded32966783,
      01ea2812-5005-4152-af2e-2586bf65b4c6,
      aecba5d0-9d9b-4ab6-9018-62f5abb7b809],
      exchId=3DGridDhtPartitionExchangeId [topVer=3DAffinityTopologyVersion
      [topVer=3D48, minorTopVer=3D0], nodeId=3D27a9bef9, evt=3DNODE_JOINED]=
,
      init=3Dtrue, ready=3Dtrue, replied=3Dfalse, added=3Dtrue,
      initFut=3DGridFutureAdapter [resFlag=3D2, res=3Dtrue,
      startTime=3D1445974657836, endTime=3D1445974658400,
      ignoreInterrupts=3Dfalse, lsnr=3Dnull, state=3DDONE], topSnapshot=3Dn=
ull,
      lastVer=3Dnull, partReleaseFut=3DGridCompoundFuture [lsnrCalls=3D3,
      finished=3Dtrue, rdc=3Dnull, init=3Dtrue,
      res=3Djava.util.concurrent.atomic.AtomicMarkableReference@6b58be0e,
      err=3Dnull, done=3Dtrue, cancelled=3Dfalse, err=3Dnull, futs=3D[true,=
 true,
      true]], skipPreload=3Dfalse, clientOnlyExchange=3Dfalse,
      oldest=3D0d44a4b3-4d10-4f67-b8bd-005be226b1df, oldestOrder=3D1,
      evtLatch=3D0, remaining=3D[0ab29a08-9c95-4054-8035-225f5828b3d4,
      0d232f1a-0f46-4798-a39a-63a17dc4dc7f,
      f9b5a77a-a4c1-46aa-872e-aeaca9b76ee3,
      596d0df7-3edf-4078-8f4a-ffa3d96296c6,
      5790761a-aeeb-44d1-9fce-3fee31ef39b7,
      55d2082e-3517-4828-8d47-57b4ed5a41bc,
      26266d8c-cc87-4472-9fa4-c526d6da2233,
      0e664e25-8dde-4df8-966b-53b60f9a1087,
      b1bf93b3-24bb-4520-ade0-31d05a93558d,
      aac08043-875a-485a-ab2c-cd7e66d68f8f,
      2fd79f9f-9590-41d2-962e-004a3d7690b5,
      5a2a1012-0766-448c-9583-25873c305de9,
      534ac94b-8dd1-4fa8-a481-539fa4f4ce55,
      b34cb942-e960-4a00-b4fb-10add6466a93,
      320d05fd-e021-40ac-83bc-62f54756771b,
      0d44a4b3-4d10-4f67-b8bd-005be226b1df,
      837785eb-24e0-496a-a0cc-f795b64b5929,
      9a00f235-3b6a-4be5-b0e3-93cd1beacaf4,
      966d70b2-e1dc-4e20-9876-b63736545abd,
      3aab732e-a075-4b19-9525-e97a1260a4fe,
      4e34ad89-fa46-4503-a599-b8c937ca1f47,
      4c7c3c47-6e5c-4c15-80a9-408192596bc2,
      85eea5fe-9aff-4821-970c-4ce006ee853a,
      7837fdfc-6255-4784-8088-09d4e6e37bb9,
      3f426f4e-2d0c-402a-a4af-9d7656f46484,
      e8778da0-a764-4ad9-afba-8a748564e12a,
      4a988e3e-3434-4271-acd6-af2a1e30524c,
      cd874e96-63cf-41c9-8e8a-75f3223bfe9d,
      f8cabdbb-875a-480b-8b5e-4b5313c5fcbd,
      dc3256a7-ae23-4c2e-b375-55e2884e045d,
      4da50521-aad0-48a4-9f79-858bbc2e6b89,
      1d370c9e-250f-4733-8b8a-7b6f5c6e1b2b,
      494ad6fd-1637-44b8-8d3a-1fa19681ba64,
      7c05abfb-dba1-43c3-a8b1-af504762ec60,
      5902c851-5275-41fd-89c4-cd6390c88670,
      19c334eb-5661-4697-879d-1082571dfef8,
      c406028e-768e-404e-8417-40d2960c4ba3,
      a0b57685-e5dc-498c-99a4-33b1aef32632,
      24be15dd-45f7-4980-b4f8-3176ab67e8f6,
      e213b903-107b-4465-8fe1-78b7b393d631,
      df981c08-148d-4266-9ea7-163168012968,
      187cd54f-396b-4c3c-9bfc-9883ac37f556,
      f0b7b298-6432-477a-85a0-83e29e8c5380,
      94ec7576-7a02-4c08-8739-4e0fc52a3d3a,
      041975f5-990a-4792-b384-eded32966783,
      01ea2812-5005-4152-af2e-2586bf65b4c6,
      aecba5d0-9d9b-4ab6-9018-62f5abb7b809], super=3DGridFutureAdapter
      [resFlag=3D0, res=3Dnull, startTime=3D1445974657836, endTime=3D0,
      ignoreInterrupts=3Dfalse, lsnr=3Dnull, state=3DINIT]]]
      <br>
      <br>
      <br>
      <br>
      Then on the occasions when mapreduce jobs fail I will see one node
      with (it isn&#39;t always the same node)
      <br>
      <br>
      <br>
      [14:52:57,080][WARN
      ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi] Closing NIO
      session because of unhandled exception [cls=3Dclass
      o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
      <br>
      [14:52:59,123][WARN
      ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi] Failed to
      process selector key (will close): GridSelectorNioSessionImpl
      [selectorIdx=3D3, queueSize=3D0,
      writeBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=3D32768 cap=3D32768]=
,
      readBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=3D32768 cap=3D32768],
      recovery=3DGridNioRecoveryDescriptor [acked=3D3, resendCnt=3D0,
      rcvCnt=3D0, reserved=3Dtrue, lastAck=3D0, nodeLeft=3Dfalse,
      node=3DTcpDiscoveryNode [id=3D837785eb-24e0-496a-a0cc-f795b64b5929,
      addrs=3D[0:0:0:0:0:0:0:1%1, 10.148.0.81, 10.159.1.176, 127.0.0.1],
      sockAddrs=3D[/<a href=3D"http://10.159.1.176:47500" target=3D"_blank"=
>10.159.1.176:47500</a>, /0:0:0:0:0:0:0:1%1:47500,
      <a href=3D"http://r1i4n4.redacted.com/10.148.0.81:47500" target=3D"_b=
lank">r1i4n4.redacted.com/10.148.0.81:47500</a>, /<a href=3D"http://10.148.=
0.81:47500" target=3D"_blank">10.148.0.81:47500</a>,
      /<a href=3D"http://10.159.1.176:47500" target=3D"_blank">10.159.1.176=
:47500</a>, /<a href=3D"http://127.0.0.1:47500" target=3D"_blank">127.0.0.1=
:47500</a>], discPort=3D47500, order=3D45,
      intOrder=3D45, lastExchangeTime=3D1445974625750, loc=3Dfalse,
      ver=3D1.4.0#19691231-sha1:00000000, isClient=3Dfalse], connected=3Dtr=
ue,
      connectCnt=3D1, queueLimit=3D5120], super=3DGridNioSessionImpl
      [locAddr=3D/<a href=3D"http://10.159.1.112:46222" target=3D"_blank">1=
0.159.1.112:46222</a>, rmtAddr=3D/<a href=3D"http://10.159.1.176:47100" tar=
get=3D"_blank">10.159.1.176:47100</a>,
      createTime=3D1445974646591, closeTime=3D0, bytesSent=3D30217,
      bytesRcvd=3D9, sndSchedTime=3D1445975577912,
      lastSndTime=3D1445975577912, lastRcvTime=3D1445974655114,
      readsPaused=3Dfalse,
      filterChain=3DFilterChain[filters=3D[GridNioCodecFilter
      [parser=3Do.a.i.i.util.nio.GridDirectParser@44de55ba,
      directMode=3Dtrue], GridConnectionBytesVerifyFilter],
      accepted=3Dfalse]]
      <br>
      [14:52:59,124][WARN
      ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi] Closing NIO
      session because of unhandled exception [cls=3Dclass
      o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
      <br>
      [14:53:00,105][WARN
      ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi] Failed to
      process selector key (will close): GridSelectorNioSessionImpl
      [selectorIdx=3D3, queueSize=3D0,
      writeBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=3D32768 cap=3D32768]=
,
      readBuf=3Djava.nio.DirectByteBuffer[pos=3D0 lim=3D32768 cap=3D32768],
      recovery=3DGridNioRecoveryDescriptor [acked=3D0, resendCnt=3D0,
      rcvCnt=3D0, reserved=3Dtrue, lastAck=3D0, nodeLeft=3Dfalse,
      node=3DTcpDiscoveryNode [id=3D4426467e-b4b4-4912-baa1-d7cc839d9188,
      addrs=3D[0:0:0:0:0:0:0:1%1, 10.148.0.106, 10.159.1.201, 127.0.0.1],
      sockAddrs=3D[/<a href=3D"http://10.159.1.201:47500" target=3D"_blank"=
>10.159.1.201:47500</a>, /0:0:0:0:0:0:0:1%1:47500,
      <a href=3D"http://r1i5n11.redacted.com/10.148.0.106:47500" target=3D"=
_blank">r1i5n11.redacted.com/10.148.0.106:47500</a>, /<a href=3D"http://10.=
148.0.106:47500" target=3D"_blank">10.148.0.106:47500</a>,
      /<a href=3D"http://10.159.1.201:47500" target=3D"_blank">10.159.1.201=
:47500</a>, /<a href=3D"http://127.0.0.1:47500" target=3D"_blank">127.0.0.1=
:47500</a>], discPort=3D47500, order=3D57,
      intOrder=3D57, lastExchangeTime=3D1445974625790, loc=3Dfalse,
      ver=3D1.4.0#19691231-sha1:00000000, isClient=3Dfalse], connected=3Dtr=
ue,
      connectCnt=3D1, queueLimit=3D5120], super=3DGridNioSessionImpl
      [locAddr=3D/<a href=3D"http://10.159.1.112:60869" target=3D"_blank">1=
0.159.1.112:60869</a>, rmtAddr=3D/<a href=3D"http://10.159.1.201:47100" tar=
get=3D"_blank">10.159.1.201:47100</a>,
      createTime=3D1445974654478, closeTime=3D0, bytesSent=3D22979,
      bytesRcvd=3D0, sndSchedTime=3D1445975577912,
      lastSndTime=3D1445975577912, lastRcvTime=3D1445974654478,
      readsPaused=3Dfalse,
      filterChain=3DFilterChain[filters=3D[GridNioCodecFilter
      [parser=3Do.a.i.i.util.nio.GridDirectParser@44de55ba,
      directMode=3Dtrue], GridConnectionBytesVerifyFilter],
      accepted=3Dfalse]]
      <br>
      [14:53:00,105][WARN
      ][grid-nio-worker-3-#135%null%][TcpCommunicationSpi] Closing NIO
      session because of unhandled exception [cls=3Dclass
      o.a.i.i.util.nio.GridNioException, msg=3DConnection timed out]
      <br>
      <br>
      <br>
      I&#39;ve tried adjusting the timeout settings further but haven&#39;t=
 had
      much success.
      <br>
      <br>
      Here is what my config looks like, it is obviously heavily based
      off the hadoop example config.
      <br>
      <br>
      <br>
      &lt;?xml version=3D&quot;1.0&quot; encoding=3D&quot;UTF-8&quot;?&gt;
      <br>
      &lt;beans
      ns1:schemaLocation=3D<a href=3D"http://www.springframework.org/schema=
/beans%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0http://www.springframework.org/sc=
hema/beans/spring-beans.xsd%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0http://www.s=
pringframework.org/schema/util%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0%C2%A0htt=
p://www.springframework.org/schema/util/spring-util.xsd" target=3D"_blank">=
&quot;http://www.springframework.org/schema/beans
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
      http://www.springframework.org/schema/beans/spring-beans.xsd=C2=A0=C2=
=A0
      =C2=A0=C2=A0=C2=A0=C2=A0 http://www.springframework.org/schema/util=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
      http://www.springframework.org/schema/util/spring-util.xsd&quot;</a>
      xmlns=3D<a href=3D"http://www.springframework.org/schema/beans" targe=
t=3D"_blank">&quot;http://www.springframework.org/schema/beans&quot;</a>
      xmlns:ns1=3D<a href=3D"http://www.w3.org/2001/XMLSchema-instance" tar=
get=3D"_blank">&quot;http://www.w3.org/2001/XMLSchema-instance&quot;</a>&gt=
;
      <br>
      =C2=A0 &lt;description&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Spring file for Ignite nod=
e configuration with IGFS and
      Apache Hadoop map-reduce support enabled.
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Ignite node will start wit=
h this configuration by default.
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/description&gt;
      <br>
      =C2=A0 &lt;bean
      class=3D&quot;org.springframework.beans.factory.config.PropertyPlaceh=
olderConfigurer&quot;
      id=3D&quot;propertyConfigurer&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;systemPropertiesModeName=
&quot;
      value=3D&quot;SYSTEM_PROPERTIES_MODE_FALLBACK&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;searchSystemEnvironment&=
quot; value=3D&quot;true&quot; /&gt;
      <br>
      =C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0 &lt;bean abstract=3D&quot;true&quot;
      class=3D&quot;org.apache.ignite.configuration.FileSystemConfiguration=
&quot;
      id=3D&quot;igfsCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;blockSize&quot; value=3D=
&quot;#{128 * 1024}&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;perNodeBatchSize&quot; v=
alue=3D&quot;512&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;perNodeParallelBatchCoun=
t&quot; value=3D&quot;16&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;prefetchBlocks&quot; val=
ue=3D&quot;32&quot; /&gt;
      <br>
      =C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0 &lt;bean abstract=3D&quot;true&quot;
      class=3D&quot;org.apache.ignite.configuration.CacheConfiguration&quot=
;
      id=3D&quot;dataCacheCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;cacheMode&quot; value=3D=
&quot;PARTITIONED&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;atomicityMode&quot; valu=
e=3D&quot;TRANSACTIONAL&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;writeSynchronizationMode=
&quot; value=3D&quot;FULL_SYNC&quot;
      /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;backups&quot; value=3D&q=
uot;0&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;affinityMapper&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper&quo=
t;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;constructor-arg value=
=3D&quot;512&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;startSize&quot; value=3D=
&quot;#{100*1024*1024}&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;offHeapMaxMemory&quot; v=
alue=3D&quot;0&quot; /&gt;
      <br>
      =C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0 &lt;bean abstract=3D&quot;true&quot;
      class=3D&quot;org.apache.ignite.configuration.CacheConfiguration&quot=
;
      id=3D&quot;metaCacheCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;cacheMode&quot; value=3D=
&quot;REPLICATED&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;atomicityMode&quot; valu=
e=3D&quot;TRANSACTIONAL&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;writeSynchronizationMode=
&quot; value=3D&quot;FULL_SYNC&quot;
      /&gt;
      <br>
      =C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.IgniteConfiguration&quo=
t;
      id=3D&quot;grid.cfg&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;failureDetectionTimeout&=
quot; value=3D&quot;3000&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;hadoopConfiguration&quot=
;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.HadoopConfiguration&quo=
t;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;=
finishedJobInfoTtl&quot; value=3D&quot;30000&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;connectorConfiguration&q=
uot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.ConnectorConfiguration&=
quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;=
port&quot; value=3D&quot;11211&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;fileSystemConfiguration&=
quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.FileSystemConfiguration=
&quot;
      parent=3D&quot;igfsCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;name&quot; value=3D&quot;igfs&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;metaCacheName&quot; value=3D&quot;igfs-meta&quot;
      /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;dataCacheName&quot; value=3D&quot;igfs-data&quot;
      /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;ipcEndpointConfiguration&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &l=
t;bean
      class=3D&quot;org.apache.ignite.igfs.IgfsIpcEndpointConfiguration&quo=
t;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 &lt;property name=3D&quot;type&quot; value=3D&quot;TCP&quot; /&gt=
;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 &lt;property name=3D&quot;host&quot; value=3D&quot;r1i0n12&quot; =
/&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 &lt;property name=3D&quot;port&quot; value=3D&quot;10500&quot; /&=
gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &l=
t;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/property&=
gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;cacheConfiguration&quot;=
&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.CacheConfiguration&quot=
;
      parent=3D&quot;metaCacheCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;name&quot; value=3D&quot;igfs-meta&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.configuration.CacheConfiguration&quot=
;
      parent=3D&quot;dataCacheCfgBase&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property n=
ame=3D&quot;name&quot; value=3D&quot;igfs-data&quot; /&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;includeEventTypes&quot;&=
gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;ns2:constant
      static-field=3D&quot;org.apache.ignite.events.EventType.EVT_TASK_FAIL=
ED&quot;
      xmlns:ns2=3D<a href=3D"http://www.springframework.org/schema/util" ta=
rget=3D"_blank">&quot;http://www.springframework.org/schema/util&quot;</a> =
/&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;ns2:constant
      static-field=3D&quot;org.apache.ignite.events.EventType.EVT_TASK_FINI=
SHED&quot;
      xmlns:ns2=3D<a href=3D"http://www.springframework.org/schema/util" ta=
rget=3D"_blank">&quot;http://www.springframework.org/schema/util&quot;</a> =
/&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;ns2:constant
      static-field=3D&quot;org.apache.ignite.events.EventType.EVT_JOB_MAPPE=
D&quot;
      xmlns:ns2=3D<a href=3D"http://www.springframework.org/schema/util" ta=
rget=3D"_blank">&quot;http://www.springframework.org/schema/util&quot;</a> =
/&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;discoverySpi&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
      class=3D&quot;org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi&quo=
t;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;property name=3D&quot;=
ipFinder&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;bean
class=3D&quot;org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryV=
mIpFinder&quot;&gt;<br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &l=
t;property name=3D&quot;addresses&quot;&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 &lt;list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 &lt;value&gt;r1i0n12:47500&lt;/value&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 &lt;/list&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &l=
t;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 &lt;/bean&gt;
      <br>
      =C2=A0=C2=A0=C2=A0 &lt;/property&gt;
      <br>
      =C2=A0 &lt;/bean&gt;
      <br>
      &lt;/beans&gt;
      <br>
      <br>
      <br>
      <br>
      Quoting Denis Magda <a href=3D"mailto:dmagda@gridgain.com" target=3D"=
_blank">&lt;dmagda@gridgain.com&gt;</a>:
      <br>
      <br>
      <blockquote type=3D"cite">Hi Joe,
        <br>
        <br>
        Great!
        <br>
        <br>
        Please see below
        <br>
        <br>
        On 10/27/2015 9:37 AM, <a href=3D"mailto:dev@eiler.net" target=3D"_=
blank">dev@eiler.net</a> wrote:
        <br>
        <blockquote type=3D"cite">Reducing the port range (to a single
          port) and lowering the
          IgniteConfiguration.setFailureDetectionTimeout to 1000 helped
          speed up everybody joining the topology and I was able to get
          a pi estimator run on 64 nodes.
          <br>
          <br>
        </blockquote>
        <br>
        I suspect that the reason was in the number of ports specified
        in the range.=C2=A0 By some reason it takes significant time to get=
 a
        response from TCP/IP stack that a connection can&#39;t be
        established on a particular port number.
        <br>
        Please try to reduce the port range, lower
        TcpDiscoverySpi.setNetworkTimeout, keep
        IgniteConfiguration.setFailureDetectionTimeout&#39;s default value
        and share results with us.
        <br>
        <br>
        <blockquote type=3D"cite">Thanks again for the help, I&#39;m over t=
he
          current hurdle.
          <br>
          Joe
          <br>
          <br>
          <br>
          Quoting <a href=3D"mailto:dev@eiler.net" target=3D"_blank">dev@ei=
ler.net</a>:
          <br>
          <br>
          <blockquote type=3D"cite">Thanks for the quick response Denis.
            <br>
            <br>
            I did a port range of 10 ports. I&#39;ll take a look at the
            failureDetectionTimeout and networkTimeout.
            <br>
            <br>
            Side question: Is there an easy way to map between the
            programmatic API and the spring XML properties? For instance
            I was trying to find the correct xml incantation for
            TcpDiscoverySpi.setMaxMissedHeartbeats(int) and I might have
            a similar issue finding
            IgniteConfiguration.setFailureDetectionTimeout(long). It
            seems like I can usually drop the set and adjust
            capitalization (setFooBar() =3D=3D &lt;property name=3D&quot;fo=
oBar&quot;)
            <br>
            <br>
          </blockquote>
        </blockquote>
        Yes, your understanding is correct.
        <br>
        <blockquote type=3D"cite">
          <blockquote type=3D"cite">Please pardon my ignorance on
            terminology:
            <br>
            Are the nodes I run ignite.sh on considered server nodes or
            cluster nodes (I would have thought they are the same)
            <br>
            <br>
          </blockquote>
        </blockquote>
        Actually we have a notion of server and client nodes. This page
        contains extensive information on the type of nodes:
        <br>
        <a href=3D"https://apacheignite.readme.io/docs/clients-vs-servers" =
target=3D"_blank">https://apacheignite.readme.io/docs/clients-vs-servers</a=
>
        <br>
        <br>
        A cluster node is just a server or client node.
        <br>
        <br>
        Regards,
        <br>
        Denis
        <br>
        <blockquote type=3D"cite">
          <blockquote type=3D"cite">Thanks,
            <br>
            Joe
            <br>
            <br>
            Quoting Denis Magda <a href=3D"mailto:dmagda@gridgain.com" targ=
et=3D"_blank">&lt;dmagda@gridgain.com&gt;</a>:
            <br>
            <br>
            <blockquote type=3D"cite">Hi Joe,
              <br>
              <br>
              How big is a port range, that you specified in your
              discovery configuration, for a every single node?
              <br>
              Please take into account that the discovery may iterate
              over every port from the range before one node connects to
              the other and depending on the TCP related settings of
              your network it may take significant time before the
              cluster is assembled.
              <br>
              <br>
              Here I would recommend you to reduce the port range as
              much as possible and to play with the following network
              related parameters:
              <br>
              - Try to use the failure detection timeout instead of
              setting socket, ack and many other timeouts explicitly
(<a href=3D"https://apacheignite.readme.io/docs/cluster-config#failure-dete=
ction-timeout" target=3D"_blank">https://apacheignite.readme.io/docs/cluste=
r-config#failure-detection-timeout</a>);<br>
              - Try to play with TcpDiscoverySpi.networkTimeout because
              this timeout is considered during the time when a cluster
              node tries to join a cluster.
              <br>
              <br>
              In order to help you with the hanging compute tasks and to
              give you more specific recommendations regarding the slow
              join process please provide us with the following:
              <br>
              - config files for server and cluster nodes;
              <br>
              - log files from all the nodes. Please start the nodes
              with -DIGNITE_QUIET=3Dfalse virtual machine property. If you
              start the nodes using <a href=3D"http://ignite.sh/bat" target=
=3D"_blank">ignite.sh/bat</a> then just pass &#39;-v&#39; as
              an argument to the script.
              <br>
              - thread dumps for the nodes that are hanging waiting for
              the compute tasks to be completed.
              <br>
              <br>
              Regards,
              <br>
              Denis
              <br>
              <br>
              On 10/26/2015 6:56 AM, <a href=3D"mailto:dev@eiler.net" targe=
t=3D"_blank">dev@eiler.net</a> wrote:
              <br>
              <blockquote type=3D"cite">Hi all,
                <br>
                <br>
                I have been experimenting with ignite and have run into
                a problem scaling up to larger clusters.
                <br>
                <br>
                I am playing with only two different use cases, 1) a
                Hadoop MapReduce accelerator 2) an in memory data grid
                (no secondary file system) being accessed by frameworks
                using the HDFS
                <br>
                <br>
                Everything works fine with a smaller cluster (8 nodes)
                but with a larger cluster (64 nodes) it takes a couple
                of minutes for all the nodes to register with the
                cluster(which would be ok) and mapreduce jobs just hang
                and never return.
                <br>
                <br>
                I&#39;ve compiled the latest Ignite 1.4 (with
                ignite.edition=3Dhadoop) from source, and am using it with
                Hadoop 2.7.1 just trying to run things like the pi
                estimator and wordcount examples.
                <br>
                <br>
                I started with the config/hadoop/default-config.xml
                <br>
                <br>
                I can&#39;t use multicast so I&#39;ve configured it to use
                static IP based discovery with just a single node/port
                range.
                <br>
                <br>
                I&#39;ve increased the heartbeat frequency to 10000 and tha=
t
                seemed to help make things more stable once all the
                nodes do join the cluster. I&#39;ve also played with
                increasing both the socket timeout and the ack timeout
                but that seemed to just make it take longer for nodes to
                attempt to join the cluster after a failed attempt.
                <br>
                <br>
                I have access to a couple of different clusters, we
                allocate resources with slurm so I get a piece of a
                cluster to play with (hence the no-multicast
                restriction). The nodes all have fast networks (FDR
                InfiniBand) and a decent amount of memory (64GB-128GB)
                but no local storage (or swap space).
                <br>
                <br>
                As mentioned earlier, I disable the secondaryFilesystem.
                <br>
                <br>
                Any advice/hints/example xml configs would be extremely
                welcome.
                <br>
                <br>
                <br>
                I also haven&#39;t been seeing the expected performance
                using the hdfs api to access ignite. I&#39;ve tried both
                using the hdfs cli to do some simple timings of put/get
                and a little java program that writes then reads a
                file.=C2=A0 Even with small files (500MB) that should be ke=
pt
                completely in a single node, I only see about 250MB/s
                for writes and reads are much slower than that (4x to
                10x).=C2=A0 The writes are better than hdfs (our hdfs is
                backed with pretty poor storage) but reads are much
                slower. Now I haven&#39;t tried scaling this at all but wit=
h
                an 8 node ignite cluster and a single &quot;client&quot; ac=
cess a
                single file I would hope for something closer to memory
                speeds. (if you would like me to split this into another
                message to the list just let me know, I&#39;m assuming the
                cause it the same---I missed a required config setting
                ;-) )
                <br>
                <br>
                Thanks in advance for any help,
                <br>
                Joe
                <br>
                <br>
                <br>
                <br>
                <br>
              </blockquote>
            </blockquote>
          </blockquote>
          <br>
          <br>
          <br>
        </blockquote>
      </blockquote>
      <br>
      <br>
      <br>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div>

--001a11c30b94c3f3c1052337d529--