crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-91) Enable custom output file naming
Date Tue, 09 Oct 2012 07:30:02 GMT


Gabriel Reid commented on CRUNCH-91:

Thanks for taking a look at it Josh. I'm still feeling a bit torn on this one -- on the one
hand, this (the ability to give output files meaningful names)  is definitely a use case that
is needed in my day-to-day work. On the other hand, I'm a bit concerned about this being a
step towards putting too many bells and whistles into Crunch, as we alternatively just have
a config option that allows you to keep the default output names provided by Hadoop, and leave
file renaming operations up to the developer.

The really cool feature (well, I think it's cool) that I can see us being able to provide
if we do go for this is to be able to have an API something like this:

// Some kind of aggregation per product
PTable<Product, PurchaseSummary> productsAndPurchaseSummaries = ...; 

// Writes out the products and purchase summary, with one file per product manufacturer, and
the file name
// is the name of the product manufacturer which is extracted from the Product value
pipeline.write(productsAndPurchaseSummaries, At.fanOut(outputDir, new ManufacturerExtractionFn());

Does that sway you (or anyone else) any more in one direction or the other? Obviously I want
to try to do something that is useful for general use cases, and not just mine (which is currently
mostly based around processing geographical data and outputting it into named files).
> Enable custom output file naming
> --------------------------------
>                 Key: CRUNCH-91
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-91.patch
> The current output file naming behavior in Crunch is to use the classic Hadoop-style
file naming (i.e. part-m-00001, part-r-00002), with the numerical part of the filename being
set based on the number of existing files in the output directory to avoid naming collisions.
> The intention of this issue is to allow developers to define their own output file names
for Crunch output files.
> The original underlying motivation for this issue is having a custom partitioner in a
job which routes records to a specific partition (and therefore reducer) based on content
of the record, and then needing to perform file renaming operations on the output files to
allow their names to include specific information about the partition they contain. The partition
number of files currently gets discarded by Crunch, making this renaming impossible. The approach
proposed here (custom file naming within Crunch) goes one step further, giving developers
a hook to actually define their own output file naming scheme.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message