crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominique Dierickx (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-347) Allow writing of single file outputs
Date Tue, 18 Feb 2014 21:07:24 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904592#comment-13904592
] 

Dominique Dierickx commented on CRUNCH-347:
-------------------------------------------

We face a similar problem that we have a need to merge the final output of the pipeline using
a "cat"-like process. If the output is simply text, then "hadoop -getmerge" would do the trick
but when using Avro or Sequencefile you basically have to write your own logic.

One thing we're investigation is using either Crunch' Shard (See http://crunch.apache.org/user-guide.html#shard)
or running a Sort on the final output in a separate pipeline, however, if sorting is not a
requirement than this may just be too much overhead I guess.

> Allow writing of single file outputs
> ------------------------------------
>
>                 Key: CRUNCH-347
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-347
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>    Affects Versions: 0.9.0
>            Reporter: Jason Gauci
>            Priority: Minor
>
> One of the outputs from our system needs to be a single file to support a system that
is ingesting the data downstream.  We currently run the job and then cat the output files
together to create the final output, but it would be nice if we could pass a flag to the write(...)
function to handle this case.
> Note that setting the number of reducers globally for the entire job doesn't work in
this case because of the significant performance implications.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message