ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode
Date Wed, 12 Sep 2018 22:37:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612823#comment-16612823
] 

ASF GitHub Bot commented on IGNITE-9026:
----------------------------------------

GitHub user DaveWHarvey opened a pull request:

    https://github.com/apache/ignite/pull/4741

    IGNITE-9026 fix random class loading failures

    Skip recursive resource requests to orginating nodes, rather than failing the entire request.
  Continue to search other nodes on errors, because assumption that all nodes have the same
view is incorrect.
    Restrict the recursive searches that a node should do when looking for resources by avoiding
the nodes that the sender has or will search.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/percipiomedia/ignite p2p_two_hops

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/ignite/pull/4741.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4741
    
----
commit 218469c6157f2aada33acb69adac60e25112a73a
Author: Dave Harvey <dharvey@...>
Date:   2018-07-18T20:51:50Z

    IGNITE-9026 fix random class loading failures
    
    Skip recursive resource requests to orginating nodes, rather than failing the entire request.
  Continue to search other nodes on errors, because assumption that all nodes have the same
view is incorrect.
    Restrict the recursive searches that a node should do when looking for resources by avoiding
the nodes that the sender has or will search.

----


> Two levels of Peer class loading fails in CONTINUOUS mode
> ---------------------------------------------------------
>
>                 Key: IGNITE-9026
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9026
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: David Harvey
>            Assignee: David Harvey
>            Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom StreamReceiver
that sometimes sends closures on the peer class loaded code to other servers.  However, we
ended up running out of Metaspace, because we had > 6000 class loaders!  We suspected
a regression in this change [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
so we switched to CONTINUOUS mode.    We then started getting failures to load some of the
classes for the closures on the second server.   Through some testing and code inspection,
there seems to be the following flaws between GridDeploymentCommunication.sendResourceRequest
and its two callers.
> The callers iterate though all the participant nodes until they find an online node that
responds to the request (timeout is treated as offline node), with either success or failure,
and then the loop terminates.  The assumption is that all nodes are equally capable of providing
the resource, so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has a check
for a cycle, i.e., whether the destination node is one of the nodes that originated or forwarded
this request, and in that case,  a failure response is faked.   However, that causes the
caller's loop to terminate.  So depending on the order of the nodes in the participant list,  sendResourceRequest()
may fail before trying any nodes because it has one of the calling nodes on this list.   
  It should instead be skipping any of the calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which forwards closure
to S2.   C1 also sends to S2 which forwards to S1.  So now the node lists on S1 and S2
contain C1 and the other S node.   If the order of the node lists on S1 is (S2,C1) and on
S2 (S1,C1), then when S1 tries to load a class, it will try S2, then S2 will try S1, but will
get a fake failure generated, causing S2 not to try more nodes (i.e., C1), and causing S1
also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to the resource. 
 Assume S1 knows about userVersion1 via S3 and S4, with S3 though C1 and S4 through C2. 
 If C2 fails, then S4 is not capable of getting back to a master, but S1 has no way of knowing
that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message