crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-405) Explore adding support for idempotent MRPipeline.plan()
Date Thu, 17 Jul 2014 19:15:05 GMT


Micah Whitacre commented on CRUNCH-405:

Yeah looks like PCollectionImpl + executor have references to pipeline.  So we could move
the logic there.  

We might need some sync logic in there to make sure two identical plans weren't executed simultaneously--
there would need to be a way for the execution of one plan to invalidate the execution of
any others that were created.

The concern here is the following flow:
//do something
//do something 

And the first flow either materializes or writes output that you'd want to make use of during
the second async?  Or that the second async doesn't fire off a job that causes the first to
fail b/c of output conflict?

> Explore adding support for idempotent MRPipeline.plan()
> -------------------------------------------------------
>                 Key: CRUNCH-405
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-405_v1.patch
> Talking through a use case with a consumer, they were interested in having the ability
to run the MRPipeline.plan() method one to many times prior to ever calling the
methods.  The reason for this was they were looking at pulling information off the MRExecutor
to tweak settings inside of their DoFns.
> Currently the MRPipeline implementation however does not have an idempotent plan() method
as it alters the state of internal values therefore affecting the full run once done() is
> It would be nice if we added an idempotent plan() method that could be gather this information
or perhaps a reset option.  

This message was sent by Atlassian JIRA

View raw message