chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ari Rabkin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Tue, 04 Aug 2009 19:08:14 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739099#action_12739099
] 

Ari Rabkin commented on CHUKWA-369:
-----------------------------------

Right now, collectors just blindly send back OK after every chunk, even if the data isn't
stable on disk.  The OK is sent *after* the data is handed to a Writer, and therefore after
Writer.add() returns.  But Writer.add() is void, and so we get no verification that the write
committed.

I'd like to have Writer.add() return one of two things:
either an OK, or else a "Witness string", which get passed back to the client.  "OK" means
that the data is now the collector's responsibility, and the agent should advance its checkpointed
state.  

The witness string is a filename in HDFS and file length.  Periodically, the agent checks
the length of the file; if it exceeds the specified length, then the data has been committed
to the file, and the agent can again advance its checkpoint.  If the data hasn't committed
within the specified period, than the agent stops all running adaptors, and resumes from the
last checkpoint.  

This is much easier to implement if we can assume a few things:
1) A single collector will commit data from a single Agent in order.
2) A single agent won't fail-over to a new collector unless the previous collector failed:
therefore, even if writes are split across collectors, we're still guaranteed commit-in-order.
3) Collector failures are rare, and therefore agents don't need to update their checkpoints
all that often, and can safely rewind several minutes in the event of failure.

All these assumptions are currently true; I just want to document them and explain clearly
that they can't be violated without breaking the reliability mechanism.

> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message