Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7DCD910B6D for ; Thu, 13 Jun 2013 11:45:22 +0000 (UTC) Received: (qmail 94393 invoked by uid 500); 13 Jun 2013 11:45:22 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 94105 invoked by uid 500); 13 Jun 2013 11:45:21 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 93561 invoked by uid 500); 13 Jun 2013 11:45:20 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 93553 invoked by uid 99); 13 Jun 2013 11:45:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 11:45:20 +0000 Date: Thu, 13 Jun 2013 11:45:20 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-218) Add new Target.WriteMode to skip the write and continue pipeline if an output target exists MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682143#comment-13682143 ] Gabriel Reid commented on CRUNCH-218: ------------------------------------- Ok, so the actual use case is the union of what I was talking about and what Josh's patch does :-) Like I said before, I definitely like the idea of both of these things, but I do think that we need to have a way of saying that we want to overwrite something that has been written in checkpoint mode previously. I'm thinking that this could either be done automatically, by checking the creation time of the data that is used to create the checkpointed data, or by having some kind of run mode to force overwriting checkpointed data. Or maybe we shouldn't worry about that, and just advise extra caution for stale data when writing in checkpoint mode. > Add new Target.WriteMode to skip the write and continue pipeline if an output target exists > ------------------------------------------------------------------------------------------- > > Key: CRUNCH-218 > URL: https://issues.apache.org/jira/browse/CRUNCH-218 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.6.0 > Reporter: Dave Beech > Assignee: Josh Wills > Priority: Minor > Attachments: CRUNCH-218b.patch, CRUNCH-218.patch > > > Quite often I write pipelines which persist data to the filesystem midway through the process, and then carry on doing further work. > If this intermediate data is already present, I think it would be good if I could set a write mode which skips over this first half of processing. This way I'd avoid running jobs unnecessarily and wasting cluster resources regenerating data I already have. > Example: > PCollection inter = pipeline.read(source).parallelDo(something).parallelDo(somethingElse); > inter.write(At.sequenceFile('output'), WriteMode.SKIP_IF_EXISTS); > PCollection final = inter.parallelDo(moreWork); > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira