incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: ChukwaArchiveManager and DemuxManager
Date Wed, 03 Feb 2010 01:47:58 GMT
Thanks Eric. I added a wiki, with a link from the main wiki page:

http://wiki.apache.org/hadoop/Chukwa
http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow

I've added your comments to step 3. Please take a look and feel free to edit
as you see fit. Can you verify the location of the InError directory in step
3? I listed it as dataSinkArchives/InError/[yyyyMMdd]/*/*.done, but I'm not
positive about that.

thanks,
Bill

On Tue, Feb 2, 2010 at 4:36 PM, Eric Yang <eyang@yahoo-inc.com> wrote:

> Hi Bill,
>
> I love to see this on our wiki.  The description is very close to what is
> happening in Chukwa data processing.  If you could update step 3 with:
>
> - DemuxManager checks for *.done file every 20 seconds.
> - If *.done file exists, move to demuxProcessing/mrInput for demux.
> - If demux is successful within 3 attempts, move demuxProcessing/mrOutput
> to
> dataSinkArchive/[yyyyMMdd]/*/*.done, otherwise move data to InError
> directory.
>
> Thanks
>
> Regards,
> Eric
>
> On 2/2/10 10:56 AM, "Bill Graham" <billgraham@gmail.com> wrote:
>
> > I had a lot of questions regarding the data flow as well. I spent a while
> > reverse engineering it and wrote something up on our internal wiki. I
> believe
> > this is what's happening. If others with more knowledge could verify what
> I
> > have below, I'll gladly move it to a wiki on the Chukwa site.
> >
> > Regarding your specific question, I believe the DemuxManager process is
> the
> > first step in aggregating the data sink files. It moves the chunks to the
> > dataSinkArchives directory once it's done with them. The ArchiveManager
> later
> > archives those chunks.
> >
> > 1.  Collectors write chunks to logs/*.chukwa files until a 64MB chunk
> size is
> > reached or a given time interval is reached.
> >> *  to: logs/*.chukwa
> > 2.  Collectors close chunks and rename them to *.done
> >> *  from: logs/*.chukwa
> >> *  to: logs/*.done
> > 3.  DemuxManager wakes up every 20 seconds, runs M/R to merges *.done
> files
> > and moves them.
> >> *  from: logs/*.done
> >> *  to: demuxProcessing/mrInput
> >> *  to: demuxProcessing/mrOutput
> >> *  to: dataSinkArchives/[yyyyMMdd]/*/*.done
> > 4.  PostProcessManager wakes up every few minutes and  aggregates, orders
> and
> > de-dups record files.
> >> *  from:
> >>
> postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[
> >> HH].R.evt
> >> *  to:
> >>
> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH
> >> ]_[N].[N].evt
> > 5.  HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group
> 5
> > minute logs to hourly.
> >> *  from:
> >>
> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm
> >> ].[N].evt
> >> *  to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
> >> *  to:
> >>
> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMd
> >> d]_[HH].[N].evt
> >> *  leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
> > 6.  DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs
> to
> > daily.
> >> *  from:
> >>
> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N]
> >> .evt
> >> *  to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
> >> *  to:
> >>
> repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N]
> >> .evt
> >> *  leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
> > 7.  ChukwaArchiveManager every half hour or so aggregates and removes
> > dataSinkArchives data using M/R.
> >> *  from: dataSinkArchives/[yyyyMMdd]/*/*.done
> >> *  to: archivesProcessing/mrInput
> >> *  to: archivesProcessing/mrOutput
> >> *  to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*
> >
> > thanks,
> > Bill
> >
> > On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <corbin@tynt.com> wrote:
> >> I am trying to understand the flow of data inside hdfs as it's processed
> by
> >> the data processor script.
> >> I see the archive.sh and demux.sh are run which runs ArchiveManager and
> >> DemuxManager.   It appears to that just reading the code that they both
> are
> >> looking at the data sink (default /chukwa/logs).
> >>
> >> Can someone shed some light on how ArchiveManager and DemuxManager
> interact?
> >>  E.g. I was under the impression that the data flowed through the
> archiving
> >> process first then got fed into the demuxing after it had created .arc
> files.
> >>
> >
> >
>
>

Mime
View raw message