crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gauci (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-347) Allow writing of single file outputs
Date Tue, 18 Feb 2014 20:44:21 GMT


Jason Gauci commented on CRUNCH-347:

I guess what the issue is asking for is more granularity on the crunch.max.reducers.  If I
set this configuration parameter to '1', then I would enforce one reducer and thus create
one file.  It would be nice if I could force one reducer on the final mapreduce in the job
that needs to output a single file without affecting the other mapreduces in the pipeline.

Another approach would be a utility function that takes a materialized PCollection that could
be composed of many files on HDFS and merges them into one file by using an identity mapper
& reducer but with the max # of reducers in that mapreduce set to 1.

> Allow writing of single file outputs
> ------------------------------------
>                 Key: CRUNCH-347
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>    Affects Versions: 0.9.0
>            Reporter: Jason Gauci
>            Priority: Minor
> One of the outputs from our system needs to be a single file to support a system that
is ingesting the data downstream.  We currently run the job and then cat the output files
together to create the final output, but it would be nice if we could pass a flag to the write(...)
function to handle this case.
> Note that setting the number of reducers globally for the entire job doesn't work in
this case because of the significant performance implications.

This message was sent by Atlassian JIRA

View raw message