cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jackson Chung (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3273) FailureDetector can take a very long time to mark a host down
Date Fri, 30 Sep 2011 00:21:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117772#comment-13117772
] 

Jackson Chung commented on CASSANDRA-3273:
------------------------------------------

smoke test only: so far so good

i have a node (6-node cluster) that was down for a LONG time (700 PHI), then start that node
for about 30 sec before stopping it

ring shows that node is down in about 20-30secs, gives or takes

{noformat}
TRACE [GossipTasks:1] 2011-09-30 00:14:58,727 FailureDetector.java (line 156) PHI for /10.40.22.186
: 703.9568334429565
TRACE [GossipTasks:1] 2011-09-30 00:14:58,727 FailureDetector.java (line 160) notifying listeners
that /10.40.22.186 is down
TRACE [GossipTasks:1] 2011-09-30 00:14:58,727 FailureDetector.java (line 161) intervals: 1027.0
1904.0 2153.0 951.0 215.0 1788.0 1002.0 1002.0 895.0 1133.0 1869.0 mean: 1267.1818181818182
DEBUG [GossipStage:1] 2011-09-30 00:14:58,728 Gossiper.java (line 661) Clearing interval times
for /10.40.22.186 due to generation change
DEBUG [GossipStage:1] 2011-09-30 00:14:58,728 FailureDetector.java (line 242) Ignoring interval
time of 2054002.0
TRACE [GossipTasks:1] 2011-09-30 00:14:59,729 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.0
TRACE [GossipTasks:1] 2011-09-30 00:15:00,730 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.43429448190325176
TRACE [GossipTasks:1] 2011-09-30 00:15:01,732 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.8690228244277856
TRACE [GossipTasks:1] 2011-09-30 00:15:02,733 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.2890479080886867
TRACE [GossipTasks:1] 2011-09-30 00:15:03,734 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.19662520906271305
TRACE [GossipTasks:1] 2011-09-30 00:15:04,735 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.20189636121957935
TRACE [GossipTasks:1] 2011-09-30 00:15:05,737 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.5977870734348798
TRACE [GossipTasks:1] 2011-09-30 00:15:06,738 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.20802340729819624
TRACE [GossipTasks:1] 2011-09-30 00:15:07,739 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.6139326289463335
TRACE [GossipTasks:1] 2011-09-30 00:15:08,740 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.21152308625862737
TRACE [GossipTasks:1] 2011-09-30 00:15:09,741 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.21261773854488178
TRACE [GossipTasks:1] 2011-09-30 00:15:10,743 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.6270982327510521
TRACE [GossipTasks:1] 2011-09-30 00:15:11,744 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.1968065146773795
TRACE [GossipTasks:1] 2011-09-30 00:15:12,745 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.579337235438655
TRACE [GossipTasks:1] 2011-09-30 00:15:13,746 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.37217274142982526
TRACE [GossipTasks:1] 2011-09-30 00:15:14,747 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.7443454828596505
TRACE [GossipTasks:1] 2011-09-30 00:15:15,757 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.3555955505071756
TRACE [GossipTasks:1] 2011-09-30 00:15:16,758 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.7083717111193488
TRACE [GossipTasks:1] 2011-09-30 00:15:17,759 FailureDetector.java (line 156) PHI for /10.40.22.186
: 1.061147871731522
TRACE [GossipTasks:1] 2011-09-30 00:15:18,760 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.3194684082936909
TRACE [GossipTasks:1] 2011-09-30 00:15:19,762 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.6395757534039692
TRACE [GossipTasks:1] 2011-09-30 00:15:20,763 FailureDetector.java (line 156) PHI for /10.40.22.186
: 0.9593636301059537
TRACE [GossipTasks:1] 2011-09-30 00:15:21,764 FailureDetector.java (line 156) PHI for /10.40.22.186
: 1.2791515068079384
TRACE [GossipTasks:1] 2011-09-30 00:15:22,765 FailureDetector.java (line 156) PHI for /10.40.22.186
: 1.598939383509923
TRACE [GossipTasks:1] 2011-09-30 00:15:23,767 FailureDetector.java (line 156) PHI for /10.40.22.186
: 1.919046728620201
TRACE [GossipTasks:1] 2011-09-30 00:15:24,768 FailureDetector.java (line 156) PHI for /10.40.22.186
: 2.238834605322186
TRACE [GossipTasks:1] 2011-09-30 00:15:25,769 FailureDetector.java (line 156) PHI for /10.40.22.186
: 2.5586224820241705
TRACE [GossipTasks:1] 2011-09-30 00:15:26,771 FailureDetector.java (line 156) PHI for /10.40.22.186
: 2.8787298271344484
TRACE [GossipTasks:1] 2011-09-30 00:15:27,772 FailureDetector.java (line 156) PHI for /10.40.22.186
: 3.198517703836433
TRACE [GossipTasks:1] 2011-09-30 00:15:28,773 FailureDetector.java (line 156) PHI for /10.40.22.186
: 3.518305580538418
TRACE [GossipTasks:1] 2011-09-30 00:15:29,774 FailureDetector.java (line 156) PHI for /10.40.22.186
: 3.838093457240402
TRACE [GossipTasks:1] 2011-09-30 00:15:30,776 FailureDetector.java (line 156) PHI for /10.40.22.186
: 4.158200802350681
TRACE [GossipTasks:1] 2011-09-30 00:15:31,777 FailureDetector.java (line 156) PHI for /10.40.22.186
: 4.4779886790526655
TRACE [GossipTasks:1] 2011-09-30 00:15:32,778 FailureDetector.java (line 156) PHI for /10.40.22.186
: 4.79777655575465
TRACE [GossipTasks:1] 2011-09-30 00:15:33,779 FailureDetector.java (line 156) PHI for /10.40.22.186
: 5.117564432456635
TRACE [GossipTasks:1] 2011-09-30 00:15:34,781 FailureDetector.java (line 156) PHI for /10.40.22.186
: 5.437671777566913
TRACE [GossipTasks:1] 2011-09-30 00:15:35,782 FailureDetector.java (line 156) PHI for /10.40.22.186
: 5.757459654268897
TRACE [GossipTasks:1] 2011-09-30 00:15:36,783 FailureDetector.java (line 156) PHI for /10.40.22.186
: 6.077247530970882
TRACE [GossipTasks:1] 2011-09-30 00:15:37,784 FailureDetector.java (line 156) PHI for /10.40.22.186
: 6.397035407672866
TRACE [GossipTasks:1] 2011-09-30 00:15:38,785 FailureDetector.java (line 156) PHI for /10.40.22.186
: 6.7168232843748505
TRACE [GossipTasks:1] 2011-09-30 00:15:39,786 FailureDetector.java (line 156) PHI for /10.40.22.186
: 7.036611161076836
TRACE [GossipTasks:1] 2011-09-30 00:15:40,788 FailureDetector.java (line 156) PHI for /10.40.22.186
: 7.356718506187114
TRACE [GossipTasks:1] 2011-09-30 00:15:41,789 FailureDetector.java (line 156) PHI for /10.40.22.186
: 7.676506382889099
TRACE [GossipTasks:1] 2011-09-30 00:15:42,790 FailureDetector.java (line 156) PHI for /10.40.22.186
: 7.996294259591083
TRACE [GossipTasks:1] 2011-09-30 00:15:43,791 FailureDetector.java (line 156) PHI for /10.40.22.186
: 8.316082136293067
TRACE [GossipTasks:1] 2011-09-30 00:15:43,792 FailureDetector.java (line 160) notifying listeners
that /10.40.22.186 is down
TRACE [GossipTasks:1] 2011-09-30 00:15:43,792 FailureDetector.java (line 161) intervals: 1001.0
2004.0 1011.0 481.0 999.0 1514.0 487.0 1551.0 450.0 1001.0 2002.0 1516.0 2003.0 3012.0 mean:
1359.4285714285713
{noformat}
                
> FailureDetector can take a very long time to mark a host down
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-3273
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3273
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 0.8.7
>
>         Attachments: 3273.txt
>
>
> There are two ways to trigger this:
> * Bring a node up very briefly in a mixed-version cluster and then terminate it
> * Bring a node up, terminate it for a very long time, then bring it back up and take
it down again
> In the first case, what can happen is a very short interval arrival time is recorded
by the versioning logic which requires reconnecting and can happen very quickly. This can
easily be solved by rejecting any intervals within a reasonable bound, for instance the gossiper
interval.
> The second instance is harder to solve, because what is happening is that an extremely
large interval is recorded, which is the time the node was left dead the first time.  This
throws off the mean of the intervals and causes it to take a much longer time than it should
to mark it down the second time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message