crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Execution Control
Date Thu, 30 Jan 2014 17:10:23 GMT
On Thu, Jan 30, 2014 at 7:09 AM, Jinal Shah <> wrote:

> Hi everyone,
> This is Jinal Shah, I'm new to the group. I had a question about Execution
> Control in Crunch. Is there any way we can force Crunch to do certain
> operations in parallel or certain operations in sequential ways. For
> example, let's say if we want the pipeline to executed a particular DoFn
> function in the Map phase instead of the Reduce phase or vice-versa. Or
> Execute a particular Flow only after a particular flow is completed as
> oppose to running it in parallel.

Forcing a DoFn to operate in a map or reduce phase is tough for the planner
to do right now; we sort of rely on the developer to have a mental model of
how the jobs will proceed. The place where you usually want to force a DoFn
to execute in the reduce vs. the map phase is when you have dependent
groupByKey operations, and you can use cache() or materialize() on the
intermediate output that you want to split on, and the planner will respect

On the latter question, the thing to look for is
org.apache.crunch.ParallelDoOptions, which isn't something I've doc'd in
the user guide yet (it's on the todo list, I promise.) You can give a
parallelDo call an additional argument that specifies one or more
SourceTargets that have to exist before a particular DoFn is allowed to
run. In this way, you can force aspects of the pipeline to be sequential
instead of parallel. We make use of ParallelDoOptions inside of the
MapsideJoinStrategy code, to ensure that the data set that we'll be loading
in-memory actually exists in the file system before we run the code that
reads it into memory.

> Maybe this might be asked before so sorry if it came again. If you guys
> have further question on the details do let me know
> Thanks everyone and Have a great day.
> Thanks
> Jinal

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message