incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corbin Hoenes <>
Subject Re: ChukwaArchiveManager and DemuxManager
Date Tue, 02 Feb 2010 19:27:42 GMT
Perfect Bill, exactly what I'm trying to put together to understand the data flow.

On Feb 2, 2010, at 11:56 AM, Bill Graham wrote:

> I had a lot of questions regarding the data flow as well. I spent a while reverse engineering
it and wrote something up on our internal wiki. I believe this is what's happening. If others
with more knowledge could verify what I have below, I'll gladly move it to a wiki on the Chukwa
> Regarding your specific question, I believe the DemuxManager process is the first step
in aggregating the data sink files. It moves the chunks to the dataSinkArchives directory
once it's done with them. The ArchiveManager later archives those chunks.
> Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is reached or
a given time interval is reached.
> to: logs/*.chukwa
> Collectors close chunks and rename them to *.done
> from: logs/*.chukwa
> to: logs/*.done
> DemuxManager wakes up every 20 seconds, runs M/R to merges *.done files and moves them.
> from: logs/*.done
> to: demuxProcessing/mrInput
> to: demuxProcessing/mrOutput
> to: dataSinkArchives/[yyyyMMdd]/*/*.done
> PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record
> from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
> to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt
> HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to
> from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt
> to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
> to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt
> leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
> DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
> from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt
> to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
> to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt
> leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
> ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data
using M/R.
> from: dataSinkArchives/[yyyyMMdd]/*/*.done
> to: archivesProcessing/mrInput
> to: archivesProcessing/mrOutput
> to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*
> thanks,
> Bill
> On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <> wrote:
> I am trying to understand the flow of data inside hdfs as it's processed by the data
processor script.
> I see the and are run which runs ArchiveManager and DemuxManager.
  It appears to that just reading the code that they both are looking at the data sink (default
> Can someone shed some light on how ArchiveManager and DemuxManager interact?  E.g. I
was under the impression that the data flowed through the archiving process first then got
fed into the demuxing after it had created .arc files.

View raw message