crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-128) Allow one stage of an MR pipeline to depend on another target being created
Date Tue, 11 Dec 2012 22:17:21 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriel Reid updated CRUNCH-128:
--------------------------------

    Attachment: CheckpointingIT.java

As mentioned on Reviewboard, I encountered an issue with this implementation where ended up
in an infinite loop in the planner.

I was trying to see if this dependency functionality would easily lend itself to adding pipeline
checkpointing (something we discussed in the past). I'm actually not even sure if this is
the way I would want to do it, but in any case, the attached test case will put the planner
into an infinite loop.

This isn't standard use of the API (yet), so it's probably not that big of a deal; on the
other hand, infinite loops aren't that cool, so if you can see an easy way to avoid getting
into an infinite loop it would be good.

It just occurred to me that this might be a case of a circular dependency somehow, in which
case it would be pretty important that that gets detected automatically.
                
> Allow one stage of an MR pipeline to depend on another target being created
> ---------------------------------------------------------------------------
>
>                 Key: CRUNCH-128
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-128
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CheckpointingIT.java, CRUNCH-128.patch
>
>
> There are a couple of problems (e.g., mapside-joins, total orderings, etc.) where we
need to guarantee that one PCollection has been written to the FileSystem before another MapReduce
pipeline that depends on that file is allowed to run. This doesn't fit cleanly into the current
set of abstractions for Crunch, which is why we force pipelines to execute via the run command
to guarantee that the files have been created before the second stage is run.
> We should add the ability for a particular PCollection to require that a SourceTarget
instance has been created before it can be executed, and the planner should incorporate this
information into the MR pipeline planning process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message