crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Shi <stepi...@live.com>
Subject Re: Deal with CPU intensive tasks
Date Wed, 06 Feb 2013 11:16:47 GMT
It works. Thanks! I added a groupByKey to force it into a MR stage.

On Wed, Feb 6, 2013 at 4:04 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Hi Chao,
>
> There's currently no way of marking a particular part of the pipeline as
> being CPU intensive -- however, what you can do is force a slightly
> different execution plan by calling "materialize.iterator()" on the
> PCollection containing the results of the "FirstPass" parallelDo. This will
> force Crunch to run the pipeline up to that point and serialize the
> "FirstPass" data, and then use the serialized collection for future
> processing instead of rebuilding it.
>
> The plan for the future is to include functionality like this in the API
> (which could also possibly run somewhat more efficiently by not immediately
> running the pipeline at such a point), but for now the materialize hack is
> the easiest way to achieve this.
>
> - Gabriel
>
>
> On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi <stepinto@live.com> wrote:
>
>> Hi crunch users,
>>
>> The execution plan of my pipeline is attached with this mail. The
>> ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive,
>> which needs to call parsers to build ASTs from source code. The best plan I
>> can imagine for my case is to have a map-only job in the front and have the
>> following 3 MRs read its output.
>>
>> I wonder if there's a way to mark my ParallelDo as CPU intensive, so that
>> crunch only create a single instane  of it.
>>
>> Thanks,
>> Chao
>>
>
>

Mime
View raw message