Return-Path: X-Original-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97C00E35B for ; Wed, 6 Feb 2013 11:17:39 +0000 (UTC) Received: (qmail 7682 invoked by uid 500); 6 Feb 2013 11:17:39 -0000 Delivered-To: apmail-incubator-crunch-user-archive@incubator.apache.org Received: (qmail 7581 invoked by uid 500); 6 Feb 2013 11:17:38 -0000 Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-user@incubator.apache.org Delivered-To: mailing list crunch-user@incubator.apache.org Received: (qmail 7549 invoked by uid 99); 6 Feb 2013 11:17:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 11:17:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MSGID_FROM_MTA_HEADER,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stepinto@live.com designates 65.55.111.87 as permitted sender) Received: from [65.55.111.87] (HELO blu0-omc2-s12.blu0.hotmail.com) (65.55.111.87) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 11:17:30 +0000 Received: from BLU0-SMTP153 ([65.55.111.71]) by blu0-omc2-s12.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 6 Feb 2013 03:17:09 -0800 X-EIP: [0GR4pFDAyNYPZbvCdyACcyBbRk/1oTni] X-Originating-Email: [stepinto@live.com] Message-ID: Received: from mail-lb0-f180.google.com ([209.85.217.180]) by BLU0-SMTP153.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Wed, 6 Feb 2013 03:17:09 -0800 Received: by mail-lb0-f180.google.com with SMTP id q12so1065259lbc.39 for ; Wed, 06 Feb 2013 03:17:07 -0800 (PST) X-Received: by 10.152.109.146 with SMTP id hs18mr26324503lab.8.1360149427358; Wed, 06 Feb 2013 03:17:07 -0800 (PST) MIME-Version: 1.0 Received: by 10.112.63.42 with HTTP; Wed, 6 Feb 2013 03:16:47 -0800 (PST) In-Reply-To: References: From: Chao Shi Date: Wed, 6 Feb 2013 19:16:47 +0800 Subject: Re: Deal with CPU intensive tasks To: crunch-user@incubator.apache.org Content-Type: multipart/alternative; boundary="bcaec54ee776758abe04d50c779a" X-OriginalArrivalTime: 06 Feb 2013 11:17:09.0256 (UTC) FILETIME=[80F77880:01CE045B] X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54ee776758abe04d50c779a Content-Type: text/plain; charset="ISO-8859-1" It works. Thanks! I added a groupByKey to force it into a MR stage. On Wed, Feb 6, 2013 at 4:04 PM, Gabriel Reid wrote: > Hi Chao, > > There's currently no way of marking a particular part of the pipeline as > being CPU intensive -- however, what you can do is force a slightly > different execution plan by calling "materialize.iterator()" on the > PCollection containing the results of the "FirstPass" parallelDo. This will > force Crunch to run the pipeline up to that point and serialize the > "FirstPass" data, and then use the serialized collection for future > processing instead of rebuilding it. > > The plan for the future is to include functionality like this in the API > (which could also possibly run somewhat more efficiently by not immediately > running the pipeline at such a point), but for now the materialize hack is > the easiest way to achieve this. > > - Gabriel > > > On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi wrote: > >> Hi crunch users, >> >> The execution plan of my pipeline is attached with this mail. The >> ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive, >> which needs to call parsers to build ASTs from source code. The best plan I >> can imagine for my case is to have a map-only job in the front and have the >> following 3 MRs read its output. >> >> I wonder if there's a way to mark my ParallelDo as CPU intensive, so that >> crunch only create a single instane of it. >> >> Thanks, >> Chao >> > > --bcaec54ee776758abe04d50c779a Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable It works. Thanks! I added a groupByKey to force it into a MR stage.
On Wed, Feb 6, 2013 at 4:04 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
Hi Chao,

There's curr= ently no way of marking a particular part of the pipeline as being CPU inte= nsive -- however, what you can do is force a slightly different execution p= lan by calling "materialize.iterator()" on the PCollection contai= ning the results of the "FirstPass" parallelDo. This will force C= runch to run the pipeline up to that point and serialize the "FirstPas= s" data, and then use the serialized collection for future processing = instead of rebuilding it.

The plan for the future is to include functionality lik= e this in the API (which could also possibly run somewhat more efficiently = by not immediately running the pipeline at such a point), but for now the m= aterialize hack is the easiest way to achieve this.

- Gabriel


On Wed, Feb 6, 2013 a= t 5:57 AM, Chao Shi <stepinto@live.com> wrote:
Hi crunch users,

The exec= ution plan of my pipeline is attached with this mail. The ParallelDo "= FirstPass" (at the top of the graph) is highly CPU intensive, which ne= eds to call parsers to build ASTs from source code. The best plan I can ima= gine for my case is to have a map-only job in the front and have the follow= ing 3 MRs read its output.

I wonder if there's a way to mark my ParallelDo as = CPU intensive, so that crunch only create a single instane =A0of it.
<= div>
Thanks,
Chao


--bcaec54ee776758abe04d50c779a--