chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Wed, 05 Aug 2009 21:11:14 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739751#action_12739751
] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

@Jerome:

The proposal is as follows:
1) In response to a PUT,the collector returns the filename and position in the sink file where
the data will be written, if it gets written. Since files have exactly one writer, we're guaranteed
that no other writer can write to that offset. And if the write succeeds, it'll be the write
corresponding to that PUT.
2) Some minutes later, the agent asks a collector, any collector, how long the indicated sink
file (or corresponding .done file) is.  If it's greater than the indicated length, the write
succeeded.  

There's one small wrinkle.  
2a) If a .done was created, and then removed by demux or archiving, collectors should continue
to show it as having been written.  There's a couple ways to do this. For instance, collectors
could also look in the archive input and output dirs, to see if the .done file is there. 
And could remember the .dones they saw previously, on the assumption that if it ever existed,
it's somewhere in the processing pipeline and the data is safe.

Furthermore, if we go this route, we really ought to do something about "marooned" .chukwa
files.  Right now, if a collector crashes or is stopped, it leaves a .chukwa file in the sink.
And these files never get processed and never get deleted.  Some other collector ought to
rename it and make it available for processing.  This is probably a good thing in general,
but not actually required for the reliability mechanism I'm proposing.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message