chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Graham (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage
Date Tue, 11 May 2010 02:49:29 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866033#action_12866033
] 

Bill Graham commented on CHUKWA-487:
------------------------------------

Actually, looking closer I can't say for sure that I had data loss. It could be just that
the bounce of the NN made the file unavailable. It appears in my case that perhaps the file
couldn't have been closed or rotated because the NN had gone down.

The only way that you could have data lose AFAIK would be if the current data dir including
the edit log got corrupted since the last SNN checkpoint. I think this is something that is
rare enough to not worry about. My concern was more of how to make sure the collector isn't
left in a bad state if part of an un-closed file was lost. 

The crash and reboot scenario is better than what we have now, but a self-recovering solution
would be ideal. This way if the NN crashed unexpectedly (perhaps during of-business hours),
all the collectors wouldn't need to be restarted. Again though, this is probably a rare occurrence.

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>            Priority: Blocker
>
> When the name node returns errors to the collector, at some point the collector dies
half way. This behavior should be changed to either resemble the agents and keep trying, or
to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting
down, and the var/pidDir/Collector.pid file gets removed, but the collector continues to run,
albeit not handling new data. Instead, this log entry is repeated ad infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message