crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Beech (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-218) Add new Target.WriteMode to skip the write and continue pipeline if an output target exists
Date Thu, 13 Jun 2013 12:29:20 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682178#comment-13682178
] 

Dave Beech commented on CRUNCH-218:
-----------------------------------

Josh - a problem. If the Crunch job fails, the output directories will have been created but
will be empty. When you then restart the job, the pipeline sees these directories and skips
the processing. The fact the folders are empty isn't really the problem - a job may produce
no data - but either way I'd want to ensure I'm only checkpointing on a successful pipeline
run. 

Because of this I've just realised Crunch output directories don't contain a "_SUCCESS" flag
file like traditional mapreduce jobs. Maybe this should be a separate JIRA. A success flag
like this would solve it, because then you'd only restart from a checkpoint path if it exists
and contains a file named "_SUCCESS". 
                
> Add new Target.WriteMode to skip the write and continue pipeline if an output target
exists
> -------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-218
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-218
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH-218b.patch, CRUNCH-218.patch
>
>
> Quite often I write pipelines which persist data to the filesystem midway through the
process, and then carry on doing further work. 
> If this intermediate data is already present, I think it would be good if I could set
a write mode which skips over this first half of processing. This way I'd avoid running jobs
unnecessarily and wasting cluster resources regenerating data I already have. 
> Example:
> PCollection<B> inter = pipeline.read(source).parallelDo(something).parallelDo(somethingElse);
> inter.write(At.sequenceFile('output'), WriteMode.SKIP_IF_EXISTS);
> PCollection<C> final = inter.parallelDo(moreWork);
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message