incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Deal with CPU intensive tasks
Date Wed, 06 Feb 2013 08:04:21 GMT
Hi Chao,

There's currently no way of marking a particular part of the pipeline as
being CPU intensive -- however, what you can do is force a slightly
different execution plan by calling "materialize.iterator()" on the
PCollection containing the results of the "FirstPass" parallelDo. This will
force Crunch to run the pipeline up to that point and serialize the
"FirstPass" data, and then use the serialized collection for future
processing instead of rebuilding it.

The plan for the future is to include functionality like this in the API
(which could also possibly run somewhat more efficiently by not immediately
running the pipeline at such a point), but for now the materialize hack is
the easiest way to achieve this.

- Gabriel

On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi <> wrote:

> Hi crunch users,
> The execution plan of my pipeline is attached with this mail. The
> ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive,
> which needs to call parsers to build ASTs from source code. The best plan I
> can imagine for my case is to have a map-only job in the front and have the
> following 3 MRs read its output.
> I wonder if there's a way to mark my ParallelDo as CPU intensive, so that
> crunch only create a single instane  of it.
> Thanks,
> Chao

View raw message