incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-128) Allow one stage of an MR pipeline to depend on another target being created
Date Mon, 17 Dec 2012 16:53:19 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534065#comment-13534065
] 

Josh Wills commented on CRUNCH-128:
-----------------------------------

Thanks Matthias. For the record, I am neutral on having the ParallelDoOperation object vs.
the regular pDo methods with new signatures (with the caveat that we need a better name than
"advancedParallelDo"). The virtue of ParallelDoOperation is the protection it provides against
pDo spiraling out of control. Thinking about this now, I think we're going to have a variation
on this that incorporates some number of PObjects as potential dependencies as well.

I'm +1 for moving CrunchRuntimeException to o.a.c., and I'm also +1 for removing sample()
and sort() from the PCollection interface, although that one should be a different JIRA.
                
> Allow one stage of an MR pipeline to depend on another target being created
> ---------------------------------------------------------------------------
>
>                 Key: CRUNCH-128
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-128
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Josh Wills
>         Attachments: CheckpointingIT.java, CRUNCH-128.patch, CRUNCH-128v2.patch, CRUNCH-128-with-op.patch
>
>
> There are a couple of problems (e.g., mapside-joins, total orderings, etc.) where we
need to guarantee that one PCollection has been written to the FileSystem before another MapReduce
pipeline that depends on that file is allowed to run. This doesn't fit cleanly into the current
set of abstractions for Crunch, which is why we force pipelines to execute via the run command
to guarantee that the files have been created before the second stage is run.
> We should add the ability for a particular PCollection to require that a SourceTarget
instance has been created before it can be executed, and the planner should incorporate this
information into the MR pipeline planning process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message