chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Thu, 06 Aug 2009 01:21:14 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739857#action_12739857
] 

Eric Yang commented on CHUKWA-369:
----------------------------------

HTTP return code should be the only contract between agent and collector.  If http return
code returns 200, then the data should be managed by the collector at that point. The async
status check from agent to the collector will only complicate things because collector could
be busy and unable to answer the status check request.  It creates domino effects for the
agent to resend chunks because the second status check may fail more than once on the busy
collector.

As a summary of the states, there are 3 ways to solve the problem.

1. Having synchronized pipeline.  (Agent write 1 minute worth of data, wait for collector
to close the file after 1 minute mark then return HTTP code).  If collector does not close
the file properly, no HTTP code is return, and agent resend the 1 minute worth of data (or
since last check point).  This depends on HDFS IO performance, previous experience with 0.18
and 0.20 yield around 20MB/s.

2. Having asynchronous pipeline, difficult to track progress of each agent with the collectors.
 A lot of memory overhead to keep track of agent status inside collector.  Status check request
may not respond, and cause retransmission frequently.

3. Use localWriter to write data on collector node first, and data is uploaded to HDFS asynchronously.
 Down side of this is, collector disk is stressed, the wear and tear of collector disk could
result of bad data being injected to HDFS without crc check.  Collector disk crash = data
lost.

There is really no perfect solution here, but  option 1 is less error prone.  As long as Hadoop
improve performance, Chukwa benefits too.



> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message