Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 735EEDA69 for ; Sun, 16 Dec 2012 20:36:13 +0000 (UTC) Received: (qmail 81336 invoked by uid 500); 16 Dec 2012 20:36:13 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 81279 invoked by uid 500); 16 Dec 2012 20:36:13 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 81244 invoked by uid 99); 16 Dec 2012 20:36:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Dec 2012 20:36:12 +0000 Date: Sun, 16 Dec 2012 20:36:12 +0000 (UTC) From: "Gabriel Reid (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-128) Allow one stage of an MR pipeline to depend on another target being created MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13533506#comment-13533506 ] Gabriel Reid commented on CRUNCH-128: ------------------------------------- Sorry for the super-slow uptake on this again -- I just went to look at it on Reviewboard and saw that there seems to be an issue with how the patch is published there (https://reviews.apache.org/r/8463/diff/2/). I think this came up on a previous patch as well on RB. In any case, I took a look at it, and I definitely agree that this is a cleaner way of doing it (i.e. adding the dependency in the call to parallelDo), so that looks good to me. I also tried it out with the checkpointing scenario that we were discussing in the past, and it appears to work perfectly in that scenario. One question that I would like to bring up is whether or not this should be in the PCollection interface, as opposed to just an additional method on PCollectionImpl. It seems that this will typically only be used internally, and adding it to the already large PCollection interface means that there are six variations of parallelDo to choose from, which can be confusing for (new) users, as well as a bit annoying with code completion in an IDE. My preference would be to leave this out of the public PCollection interface. What do you think? > Allow one stage of an MR pipeline to depend on another target being created > --------------------------------------------------------------------------- > > Key: CRUNCH-128 > URL: https://issues.apache.org/jira/browse/CRUNCH-128 > Project: Crunch > Issue Type: Improvement > Reporter: Josh Wills > Attachments: CheckpointingIT.java, CRUNCH-128.patch, CRUNCH-128v2.patch > > > There are a couple of problems (e.g., mapside-joins, total orderings, etc.) where we need to guarantee that one PCollection has been written to the FileSystem before another MapReduce pipeline that depends on that file is allowed to run. This doesn't fit cleanly into the current set of abstractions for Crunch, which is why we force pipelines to execute via the run command to guarantee that the files have been created before the second stage is run. > We should add the ability for a particular PCollection to require that a SourceTarget instance has been created before it can be executed, and the planner should incorporate this information into the MR pipeline planning process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira