ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Setrakyan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-752) Speed up failure detection
Date Fri, 24 Jul 2015 09:57:05 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640227#comment-14640227

Dmitriy Setrakyan commented on IGNITE-752:

I looked at the code and have some questions:

# I am not sure {{failureDetectionThreshold}} is the right name. Wouldn't {{failureDetectionTimeout}}
make more sense?
# I tried to read the javadoc on {{IgniteConfiguration}}, but I think it is trying to say
too much. How about this say just briefly explain what it does, without trying to confuse
users with explanation of how the implementation works? For example,
Failure detection timeout is used to determine how long a the communication or discovery SPIs
should wait before considering a remote connection failed.
# Then in the SPI javadocs for communication and discovery, you can say:
{{failureDetectionTimeout}} automatically controls the following parameters: a, b, c, d. If
any of those parameters is set explicitly, then the {{failureDetectionTimeout}} setting will
be ignored.

> Speed up failure detection
> --------------------------
>                 Key: IGNITE-752
>                 URL: https://issues.apache.org/jira/browse/IGNITE-752
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Yakov Zhdanov
>            Assignee: Denis Magda
>            Priority: Blocker
>             Fix For: sprint-7
>         Attachments: 882.patch, ignite-752.patch
> I think we can (1) make grid configuration significantly easier and (2) speed up failure
> Here are disco SPI configuration properties which are responsible for failure detection:
> # reconnectCount,
> # sockTimeout,
> # networkTImeout, 
> # ackTImeout, 
> # maxAckTimeout,
> # heartbeatFrequency 
> # maxMissedHearbeats
> Same for communication SPI
> # reconnectCount, 
> # maxConnTimeout, 
> # connTimeout
> So, we have 10 or even more properties.
> We did it to address half-opened sockets problem (which is pretty common for cloud environment)
and GC pauses which may happen on cluster nodes - we can increase ack timeouts to prevent
them from being kicked off the topology.
> By setting value for these props I set timeout for failure detection. Why do we need
such great number of parameters instead of having 1 on IgniteConfiguration - nodeResponseThreshold
(or failureDetectionThreshold - can anyone propose better name?).
> All other parameters will be calculated automatically (I think user can still set some
of them for full control over situation - need to decide if this is needed.)

This message was sent by Atlassian JIRA

View raw message