incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corbin Hoenes <>
Subject Re: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner
Date Thu, 22 Jul 2010 19:34:23 GMT
-getmerge seems to work...  any other suggestions on formats?  I like the idea of making the
filename more hadoopy looking.
MyDataType_20100720_0_35_part-00001.R?  Might require more code change to tack it onto the
extension haven't looked at that bit of code yet.

On Jul 21, 2010, at 10:35 AM, Eric Yang wrote:

> I think this is in the right direction.  Does this filename convention allows dfs –getmerge
to work on the directory?  If it does, then I am fine with it.  If it doesn’t, it may be
good to label output file name  as MyDataType_20100720_0_35.R_part0 to align with default
output name of mapreduce.
> Regards,
> Eric
> On 7/20/10 11:48 PM, "Corbin Hoenes" <> wrote:
>> I was looking at replacing the ChukwaRecordPartitioner with a HashbasedRecordParitioner.
We discussed this earlier here.... there is an issue in JIRA:
>> I patched chukwa to allow for a pluggable partitioner and configured chukwa to use
the hash based partitioner.  But it started failing to rename the _temporary files during
the commit phase after the reduce was finished because now there were multiple reducers trying
to move files to /chukwa/demuxProcessing/mrOutput with the same filename.   So I added a bit
more to the filename in ChukwaRecordOutputFormat
>> private String getParition(ChukwaRecordKey key, ChukwaRecord record) {
>> return "part" + paritioner.getPartition(key, record, conf.getInt("mapred.reduce.tasks",
>> }
>> @Override
>> protected String generateFileNameForKeyValue(ChukwaRecordKey key,
>> ChukwaRecord record, String name) {
>> String output = RecordUtil.getClusterName(record) + "/"
>> + key.getReduceType() + "/" + key.getReduceType() + "_" + getParition(key, record)
>> + Util.generateTimeOutput(record.getTime());
>> return output;
>> } 
>> So my filenames are now /chukwa/demuxProcessing/mrOutput/MyCluster/MyDataType/MyDataType_part0_20100720_0_35.R.evt
>> Just added the part to the filename and now when PostProcessorManager picks up that
directory it can mv each file into the correctly time bucket in /chukwa/repos (it increments
a count for each file in that directory.
>> Is there a better solution--I am not sure how general purpose my solution is.

View raw message