chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-369) proposed reliability mechanism
Date Wed, 05 Aug 2009 16:59:14 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739607#action_12739607
] 

Eric Yang commented on CHUKWA-369:
----------------------------------

There was another attempt to address this issue, and I think we could learn something from
our past attempt.  The main thing that we learn was that *flush on every write while holding
the client connection yield poor performance collector.*  

LocalWriter was designed to handle this problem.  Instead of writing to HDFS directly, it
writes to local file system then put the file onto HDFS.  The main idea was to disconnect
the synchronization between agent, collector, and data node while improving data reliability.
 When collector crashed, it will resume to process from it's local disk.

The LocalWriter was not finished but it shows some promising idea in addressing the reliability
problem.  The major flaw was that writing to local disk was faster than writing to HDFS, and
the result was frequent disk full in the collector.  The actual implementation could be improved
by limiting the local disk usage and stop receiving additional chunks if the disk queue is
reaches quota.  This should improve collector reliability without using the synchronized pipeline.


> proposed reliability mechanism
> ------------------------------
>
>                 Key: CHUKWA-369
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-369
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: data collection
>    Affects Versions: 0.3.0
>            Reporter: Ari Rabkin
>             Fix For: 0.3.0
>
>
> We like to say that Chukwa is a system for reliable log collection. It isn't, quite,
since we don't handle collector crashes.  Here's a proposed reliability mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message