crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-405) Explore adding support for idempotent MRPipeline.plan()
Date Thu, 17 Jul 2014 18:25:07 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Micah Whitacre updated CRUNCH-405:
----------------------------------

    Attachment: CRUNCH-405_v1.patch

So here is the first pass at this.  The issue I ran into is that when we "plan" we manipulate
the PCollectionImpl instance when it is materialized and set the "materializedAt" value. 
This then causes an NPE when we replan because the "outputTargetsToMaterialize" is not modified
to remove the value.  This affects the next attempt to plan.  It is slightly odd to be modifying
the PCollectionImpls but understandable since it handles multiple calls to pipeline.run()
for a single pipeline.

Remaining work:
* Add a cleanup to the MRExecutor instead of pipeline to make the cleanup more natural.
* Determine correct way to handle/manipulate the PCollectionImpls (e.g. add clone vs the current
clearing hack)

> Explore adding support for idempotent MRPipeline.plan()
> -------------------------------------------------------
>
>                 Key: CRUNCH-405
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-405
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-405_v1.patch
>
>
> Talking through a use case with a consumer, they were interested in having the ability
to run the MRPipeline.plan() method one to many times prior to ever calling the Pipeline.run/done
methods.  The reason for this was they were looking at pulling information off the MRExecutor
to tweak settings inside of their DoFns.
> Currently the MRPipeline implementation however does not have an idempotent plan() method
as it alters the state of internal values therefore affecting the full run once done() is
called.  
> It would be nice if we added an idempotent plan() method that could be gather this information
or perhaps a reset option.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message