incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: ChukwaArchiveManager and DemuxManager
Date Tue, 02 Feb 2010 19:48:24 GMT
I'm doing the same thing with Pig and log files.

If the date format/location of your log entries doesn't match the chukwa
date format found in the TsProcessor, you'll need to write your own. The
TsProcessor is a good example to follow. You'll need to configure your
processor to be used for your datatype in chukwa-demux-conf.xml. Even if you
use the TsProcessor, you'll need to configure that in chukwa-demux-conf.xml,
since the default processor is DefaultProcessor (despite a bug in the wiki
documentation that says TsProcessor).

If you write your own processor, be aware of this JIRA:
https://issues.apache.org/jira/browse/CHUKWA-440

The processor should be the only Chukwa customization you need to do. If you
follow the TsProcessor example, all you're doing is determining the
timestamp of the record. All the downstream processes should work fine
without customization.

If your processor is working properly you should see *.evt files written
beneith the repos/ dir. If it's not working, the data will go into an error
directory, probably because the date parsing failed (the MR logs will
indicate the cause). These are the files you'll write your Pig scripts
against using the ChukwaStorage class to read the Chukwa sequence files.
Here's an example of the start of a script which normalizes the timestamp of
the record down to 5 minutes:

define chukwaLoader org.apache.hadoop.chukwa.ChukwaStorage();
define timePeriod   org.apache.hadoop.chukwa.TimePartition('300000');

raw = LOAD '/chukwa/repos/path/to/evt/files' USING chukwaLoader AS (ts:
long, fields);
bodies = FOREACH raw GENERATE (chararray)fields#'body' as body,
timePeriod(ts) as time;

Also, if you want to generate a sequence file from an apache log for
testing, without setting up the chukwa cluster you can use the utility
discussed here FYI:

https://issues.apache.org/jira/browse/CHUKWA-449

HTH,
Bill


On Tue, Feb 2, 2010 at 11:18 AM, Corbin Hoenes <corbin@tynt.com> wrote:

> This is exactly what I've been trying to create so that I can understand
> how we can use the data once in chukwa.
> Our goal is to use pig to process our apache logs.  It looks like I need to
> customize the demux with a custom processor to create a chukwa record per
> line in the log file since right now we get a chukwa record per chunk which
> isn't useful to our pig scripts.
>
> I noticed in another conversation you've written a custom processor.  What
> kinds of data are you processing?  Did you find you had to split up the
> chunked data into individual ChukwaRecords?  And how does this affect the
> rest of processing (archiving,postprocessing etc...)  I am trying to
> understand how much customization I'm going to have to do.
>
>
> On Feb 2, 2010, at 11:56 AM, Bill Graham wrote:
>
> I had a lot of questions regarding the data flow as well. I spent a while
> reverse engineering it and wrote something up on our internal wiki. I
> believe this is what's happening. If others with more knowledge could verify
> what I have below, I'll gladly move it to a wiki on the Chukwa site.
>
> Regarding your specific question, I believe the DemuxManager process is the
> first step in aggregating the data sink files. It moves the chunks to the
> dataSinkArchives directory once it's done with them. The ArchiveManager
> later archives those chunks.
>
>
>    1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk
>    size is reached or a given time interval is reached.
>       - to: logs/*.chukwa
>    2. Collectors close chunks and rename them to *.done
>       - from: logs/*.chukwa
>       - to: logs/*.done
>    3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.donefiles and moves
them.
>       - from: logs/*.done
>       - to: demuxProcessing/mrInput
>       - to: demuxProcessing/mrOutput
>       - to: dataSinkArchives/[yyyyMMdd]/*/*.done
>    4. PostProcessManager wakes up every few minutes and aggregates, orders
>    and de-dups record files.
>       - from:
>       postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
>       - to:
>       repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt
>    5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group
>    5 minute logs to hourly.
>       - from:
>       repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt
>       - to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
>       - to:
>       repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt
>       - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
>    6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly
>    logs to daily.
>       - from:
>       repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt
>       - to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
>       - to:
>       repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt
>       - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
>    7. ChukwaArchiveManager every half hour or so aggregates and removes
>    dataSinkArchives data using M/R.
>       - from: dataSinkArchives/[yyyyMMdd]/*/*.done
>       - to: archivesProcessing/mrInput
>       - to: archivesProcessing/mrOutput
>       - to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*
>
>
> thanks,
> Bill
>
> On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <corbin@tynt.com> wrote:
>
>> I am trying to understand the flow of data inside hdfs as it's processed
>> by the data processor script.
>> I see the archive.sh and demux.sh are run which runs ArchiveManager and
>> DemuxManager.   It appears to that just reading the code that they both are
>> looking at the data sink (default /chukwa/logs).
>>
>> Can someone shed some light on how ArchiveManager and DemuxManager
>> interact?  E.g. I was under the impression that the data flowed through the
>> archiving process first then got fed into the demuxing after it had created
>> .arc files.
>>
>>
>
>

Mime
View raw message