Return-Path: X-Original-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CDD79E57C for ; Wed, 6 Feb 2013 08:04:48 +0000 (UTC) Received: (qmail 80107 invoked by uid 500); 6 Feb 2013 08:04:48 -0000 Delivered-To: apmail-incubator-crunch-user-archive@incubator.apache.org Received: (qmail 80068 invoked by uid 500); 6 Feb 2013 08:04:48 -0000 Mailing-List: contact crunch-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-user@incubator.apache.org Delivered-To: mailing list crunch-user@incubator.apache.org Received: (qmail 80037 invoked by uid 99); 6 Feb 2013 08:04:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 08:04:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gabriel.reid@gmail.com designates 209.85.212.180 as permitted sender) Received: from [209.85.212.180] (HELO mail-wi0-f180.google.com) (209.85.212.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2013 08:04:42 +0000 Received: by mail-wi0-f180.google.com with SMTP id hi8so1264832wib.1 for ; Wed, 06 Feb 2013 00:04:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=mCcrXFiNS8Wn57kZyHz5s+wUXvI2o/MHsHPjZK9nUbQ=; b=f2xfpsgBHdS2pD5G0rkHauVKQiBryBvH/vklYSjAtAd/1qS00v7cGQYmog2Eh2WbgI VvXzF/igvGyZ6RArEALS3sp19IdKZ/PJNhZeiXjUuguuM/ofMGHy5PBZavzvPJKNvoi0 CzJ1R+wKHVZxbY4MQ90GVlcBslwjKibRNwWbX2l+9o5Os8AswO+CLdKmn8HKxkwRNIuM D6n7Zouw0L2bpWuraH4ZnzGD06v1B4H8sJSFQQ/DOY40e5EA1E69VQuijP2+vcg3cDY5 6Mo6AkubsGYx87HgS81kjtaTPRxhUju5EOVSw8Ln25IjVpUlmxxQse63VAFWN/oe/5v/ zhvQ== MIME-Version: 1.0 X-Received: by 10.194.216.66 with SMTP id oo2mr47651754wjc.4.1360137861250; Wed, 06 Feb 2013 00:04:21 -0800 (PST) Received: by 10.194.46.105 with HTTP; Wed, 6 Feb 2013 00:04:21 -0800 (PST) In-Reply-To: References: Date: Wed, 6 Feb 2013 09:04:21 +0100 Message-ID: Subject: Re: Deal with CPU intensive tasks From: Gabriel Reid To: crunch-user@incubator.apache.org Content-Type: multipart/alternative; boundary=089e013d163410bd4004d509c62b X-Virus-Checked: Checked by ClamAV on apache.org --089e013d163410bd4004d509c62b Content-Type: text/plain; charset=ISO-8859-1 Hi Chao, There's currently no way of marking a particular part of the pipeline as being CPU intensive -- however, what you can do is force a slightly different execution plan by calling "materialize.iterator()" on the PCollection containing the results of the "FirstPass" parallelDo. This will force Crunch to run the pipeline up to that point and serialize the "FirstPass" data, and then use the serialized collection for future processing instead of rebuilding it. The plan for the future is to include functionality like this in the API (which could also possibly run somewhat more efficiently by not immediately running the pipeline at such a point), but for now the materialize hack is the easiest way to achieve this. - Gabriel On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi wrote: > Hi crunch users, > > The execution plan of my pipeline is attached with this mail. The > ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive, > which needs to call parsers to build ASTs from source code. The best plan I > can imagine for my case is to have a map-only job in the front and have the > following 3 MRs read its output. > > I wonder if there's a way to mark my ParallelDo as CPU intensive, so that > crunch only create a single instane of it. > > Thanks, > Chao > --089e013d163410bd4004d509c62b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Chao,

There's currently no way of marking a parti= cular part of the pipeline as being CPU intensive -- however, what you can = do is force a slightly different execution plan by calling "materializ= e.iterator()" on the PCollection containing the results of the "F= irstPass" parallelDo. This will force Crunch to run the pipeline up to= that point and serialize the "FirstPass" data, and then use the = serialized collection for future processing instead of rebuilding it.

The plan for the future is to include functionality lik= e this in the API (which could also possibly run somewhat more efficiently = by not immediately running the pipeline at such a point), but for now the m= aterialize hack is the easiest way to achieve this.

- Gabriel


= On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi <stepinto@live.com>= wrote:
Hi crunch users,

The exec= ution plan of my pipeline is attached with this mail. The ParallelDo "= FirstPass" (at the top of the graph) is highly CPU intensive, which ne= eds to call parsers to build ASTs from source code. The best plan I can ima= gine for my case is to have a map-only job in the front and have the follow= ing 3 MRs read its output.

I wonder if there's a way to mark my ParallelDo as = CPU intensive, so that crunch only create a single instane =A0of it.
<= div>
Thanks,
Chao

--089e013d163410bd4004d509c62b--