chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Graham (JIRA)" <j...@apache.org>
Subject [jira] Updated: (CHUKWA-533) Improve fault-tolerance of collectors.
Date Sat, 20 Nov 2010 06:00:54 GMT

     [ https://issues.apache.org/jira/browse/CHUKWA-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bill Graham updated CHUKWA-533:
-------------------------------

    Attachment: CHUKWA-533-1.patch

Here's a first pass at of a patch for review. I've changed the {{rotate}} and {{add}} methods
to be more fault-tolerant (i.e. to be able to survive a temporary HDFS outage). The {{init}}
method still requires HDFS, so HDFS must be running for the collector to start. We can revisit
this decision if people see the need.

I changed {{add}} to return {{COMMIT_FAIL}} if the chunks couldn't be added to the sequence
file and I don't update the {{dataSize}} and {{bytesThisRotate}} unless the sequence file
append succeeds. The {{ServletCollector}} returns a 503 if this method returns {{COMMIT_FAIL}}.

I changed {{rotate}} to basically log and swallow the error.

I changed {{ServletCollector}} to not update stats if it gets a {{COMMIT_FAIL}} response.

The only issue that I see with this approach is that if the agent sends chunks and gets back
commit pending acks  for those chunks, HDFS can still go down and the file will not be rotated.
This is the same though as the current behavior, except now the collector won't die. If guaranteed
writes are desired, then the {{AsyncAckSender}} should be used.

> Improve fault-tolerance of collectors.
> --------------------------------------
>
>                 Key: CHUKWA-533
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-533
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: data collection
>            Reporter: Bill Graham
>         Attachments: CHUKWA-533-1.patch
>
>
> There are currently a number of ways that a collector can die, typically due to errors
on a DN or a NN that's being restarted. A collector should have some combination of retry
logic followed by failing back to the agent, but the collector process should not die.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message