chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Graham (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-533) Improve fault-tolerance of collectors.
Date Tue, 12 Oct 2010 18:01:46 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920283#action_12920283
] 

Bill Graham commented on CHUKWA-533:
------------------------------------

Examples from the logs when a NN gets unexpectedly rebooted:

- From an active collector taking traffic:
{noformat}
2010-10-12 04:05:13,721 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:2,numberchunks:105
2010-10-12 04:05:15,508 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=24724
dataRate=823
2010-10-12 04:05:45,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:05:46,894 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 0 time(s).
2010-10-12 04:05:59,899 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 0 time(s).
2010-10-12 04:06:03,903 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 1 time(s).
2010-10-12 04:06:07,502 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 2 time(s).
2010-10-12 04:06:11,506 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 3 time(s).
2010-10-12 04:06:13,733 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:06:15,509 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 4 time(s).
2010-10-12 04:06:15,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:06:19,512 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 5 time(s).
2010-10-12 04:06:23,517 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 6 time(s).
2010-10-12 04:06:27,521 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 7 time(s).
2010-10-12 04:06:31,525 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 8 time(s).
2010-10-12 04:06:35,529 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 9 time(s).
2010-10-12 04:06:38,534 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781
2010-10-12 04:06:43,545 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 0 time(s).
2010-10-12 04:06:45,527 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:06:47,550 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 1 time(s).
2010-10-12 04:06:51,553 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 2 time(s).
2010-10-12 04:06:55,556 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 3 time(s).
2010-10-12 04:06:59,215 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 4 time(s).
2010-10-12 04:07:03,219 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 5 time(s).
2010-10-12 04:07:07,222 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 6 time(s).
2010-10-12 04:07:11,225 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 7 time(s).
2010-10-12 04:07:13,746 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:07:15,230 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 8 time(s).
2010-10-12 04:07:15,534 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:07:19,235 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 9 time(s).
2010-10-12 04:07:22,237 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781
2010-10-12 04:07:27,242 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 0 time(s).
2010-10-12 04:07:31,246 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 1 time(s).
2010-10-12 04:07:35,251 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 2 time(s).
2010-10-12 04:07:39,254 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 3 time(s).
2010-10-12 04:07:43,258 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 4 time(s).
2010-10-12 04:07:45,541 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:07:47,261 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000.
Already tried 5 time(s).
{noformat}

- From an idle collector that got traffic as soon as the active collector died:
{noformat}
2010-10-12 04:10:33,690 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:11:02,165 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-10-12 04:11:03,688 WARN Timer-196 SeqFileWriter - Got an exception in rotate
2010-10-12 04:11:03,688 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_23442132
2010-10-12 04:11:03,693 FATAL Timer-196 SeqFileWriter - IO Exception in rotate. Exiting!
2010-10-12 04:11:03,696 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0
dataRate=0
2010-10-12 04:11:03,697 WARN Shutdown SeqFileWriter - cannot rename dataSink file:/chukwa/logs/201012035922632_c18rbhadoopwkrr10n1cnetcom_4435f4d212b9ca438d77e7e.chukwa
{noformat}

> Improve fault-tolerance of collectors.
> --------------------------------------
>
>                 Key: CHUKWA-533
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-533
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: data collection
>            Reporter: Bill Graham
>
> There are currently a number of ways that a collector can die, typically due to errors
on a DN or a NN that's being restarted. A collector should have some combination of retry
logic followed by failing back to the agent, but the collector process should not die.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message