ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noam Liran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-2219) ClassCastException from NodeIdMessage to AffinityTopologyVersion
Date Tue, 22 Dec 2015 15:05:46 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068213#comment-15068213
] 

Noam Liran commented on IGNITE-2219:
------------------------------------

We added debug prints in org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java)
and now have the problematic buffer;

We've attached the full message buffer at this point in hex; we also think these log lines
are relevant:
{noformat}
2015-12-22 14:21:11.800 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    tcp-comm-worker-#1%null
ctx:             actor:             - Connect timed out (consider increasing 'failureDetectionTimeout'
configuration property) [addr=/0:0:0:0:0:0:0:1%lo:47105, failureDetectionTimeout=10000]
2015-12-22 14:21:11.800 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    tcp-comm-worker-#1%null
ctx:             actor:             - Connect timed out (consider increasing 'failureDetectionTimeout'
configuration property) [addr=/127.0.0.1:47105, failureDetectionTimeout=10000]
2015-12-22 14:21:11.801 WARN  o.a.i.s.c.tcp.TcpCommunicationSpi    tcp-comm-worker-#1%null
ctx:             actor:             - Failed to connect to a remote node (make sure that destination
node is alive and operating system firewall is disabled on local and remote hosts) [addrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47105,
/0:0:0:0:0:0:0:1%lo:47105, /127.0.0.1:47105]]
2015-12-22 14:21:11.807 WARN  o.a.i.i.p.c.GridCachePartitionExchangeManager ignite-#23%sys-null%
ctx:             actor:             - Failed to send partitions full message [node=TcpDiscoveryNode
[id=fe615af4-4935-4edb-a947-cc431e97ea48, addrs=[0:0:0:0:0:0:0:1%lo, 10.0.1.62, 127.0.0.1],
sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504, /0:0:0:0:0:0:0:1%lo:47504,
/10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, order=10, intOrder=10, lastExchangeTime=1450793971531,
loc=false, ver=1.4.0#20150924-sha1:c2def5f6, isClient=false], err=class o.a.i.IgniteCheckedException:
Failed to send message (node may have left the grid or TCP connection cannot be established
due to firewall issues) [node=TcpDiscoveryNode [id=fe615af4-4935-4edb-a947-cc431e97ea48, addrs=[0:0:0:0:0:0:0:1%lo,
10.0.1.62, 127.0.0.1], sockAddrs=[ip-10-0-1-62.eu-west-1.compute.internal/10.0.1.62:47504,
/0:0:0:0:0:0:0:1%lo:47504, /10.0.1.62:47504, /127.0.0.1:47504], discPort=47504, order=10,
intOrder=10, lastExchangeTime=1450793971531, loc=false, ver=1.4.0#20150924-sha1:c2def5f6,
isClient=false], topic=TOPIC_CACHE, msg=GridDhtPartitionsFullMessage [parts={218603958=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=451, size=18], 816937872=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=446, size=18], -2100569601=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, size=18], 689859866=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=467, size=18], -423204530=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=462, size=18], 1325947219=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=464, size=18], 1059659892=GridDhtPartitionFullMap
[nodeId=d2c6b831-09ca-426b-8de7-84a05157a752, nodeOrder=1, updateSeq=459, size=18]}, topVer=AffinityTopologyVersion
[topVer=-1, minorTopVer=0], partCnt=7, super=GridDhtPartitionsAbstractMessage [exchId=null,
lastVer=null, super=GridCacheMessage [msgId=909, depInfo=null, err=null, skipPrepare=false]]],
policy=2]]
{noformat}

The *really* interesting thing we found is this: in the full buffer, the first field is the
length of partsBytes: {{521F0100}} = {{0x00011f52}} = {{73554}}.
If you skip it (+ 4 bytes for the length itself) you get to this part of the buffer: 
{{FF0002000000000000000000FFFFFFFF08000000}}
Since the first byte is 0xFF, it is being deserialized as {{NodeIdMessage}} even though {{GridDhtPartitionsFullMessage}}
mandates that at this point only {{AffinityTopologyVersion}} should be found.

The thing is that, if you look a bit before that in the buffer you can actually find {{AffinityTopologyVersion}}:
{noformat} 
... 41FC41FD41FE41FF41 6F00000000FFFFFFFFFFFFFFFF 0002000000000000000000FFFFFFFF08000000 
{noformat}

Obviously something is a miss here: either partsBytes' length is 12 bytes too long, or 12
bytes went missing somewhere.

Regards,
Noam and [~avihai.berkovitz@microsoft.com]

> ClassCastException from NodeIdMessage to AffinityTopologyVersion
> ----------------------------------------------------------------
>
>                 Key: IGNITE-2219
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2219
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: ignite-1.4
>         Environment: Ubuntu 12.04 64 bit
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> Ignite 1.4.0
>            Reporter: Avihai Berkovitz
>         Attachments: message-hex.txt
>
>
> We had a cluster up and running for a couple of days. Without doing anything new, we
got the following error in one of the nodes:
> {noformat}
> Caught unhandled exception in NIO worker thread (restart the node). 
> java.lang.ClassCastException: org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$NodeIdMessage
cannot be cast to org.apache.ignite.internal.processors.affinity.AffinityTopologyVersion
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsFullMessage.readFrom(GridDhtPartitionsFullMessage.java:176)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.direct.DirectByteBufferStream.readMessage(DirectByteBufferStream.java:963)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.direct.DirectMessageReader.readMessage(DirectMessageReader.java:252)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:249)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:79)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:78)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2124)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:898)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1437)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1379)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1263)
~[ignite-core-1.4.0.jar:1.4.0]
> 	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-1.4.0.jar:1.4.0]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> It happened only once so far, but killed the communication from this node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message