Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 121E110BE6 for ; Tue, 12 Nov 2013 16:27:04 +0000 (UTC) Received: (qmail 2896 invoked by uid 500); 12 Nov 2013 16:27:03 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 2756 invoked by uid 500); 12 Nov 2013 16:26:58 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 2748 invoked by uid 99); 12 Nov 2013 16:26:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Nov 2013 16:26:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.216.180 as permitted sender) Received: from [209.85.216.180] (HELO mail-qc0-f180.google.com) (209.85.216.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Nov 2013 16:26:53 +0000 Received: by mail-qc0-f180.google.com with SMTP id e9so5347803qcy.39 for ; Tue, 12 Nov 2013 08:26:32 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=ECpHfpa6YEhcV8qAp+7VI+OWQFCHC/s5XcevVKvHdRM=; b=P91Ko8ANH5lj+PO/6f97LhHmMOotfWhcZQG5DVNSn7bhdFYMeOWNGBNAserzobuFEx n45KME0eETHRDYtprNUJk33nM4YFK/sbvrxIptqGHKot1iEVOUMMunCGN9pukTS9DD6x 2Gv64SecSmt8R/+c1M6lXSXDB65e4olr1l6DkeWdKus3EDi1qagNB6kTv+5e3h2R862H wCly1lM2RljPbGvTNgr1dydtYUFd5p4aF4D1tTyencCtW81UJskzHfrJF8xs1qwwRoe8 33zYCEkZl7+Q2HiduIkPGlOM9BUuel9fDCuWS1zlGafiirtPqSkTvaoEIU6ZSFbvs2um GHrA== X-Gm-Message-State: ALoCoQmMsh+l5jiluBgS8AVj+rcNylLhWyYjwfIw5bdjG7iFY2aYZhb4HRw3/xKH0SUQOkQ/A0K1 X-Received: by 10.224.167.131 with SMTP id q3mr43340707qay.41.1384273592454; Tue, 12 Nov 2013 08:26:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.224.27.148 with HTTP; Tue, 12 Nov 2013 08:26:12 -0800 (PST) In-Reply-To: References: From: Josh Wills Date: Tue, 12 Nov 2013 08:26:12 -0800 Message-ID: Subject: Re: Same processing in two m/r jobs To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=089e01538692c0219504eafd4fa1 X-Virus-Checked: Checked by ClamAV on apache.org --089e01538692c0219504eafd4fa1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey Surbhi, The planner is trying to minimize the amount of data it writes to disk at the end of the first job; it doesn't usually worry so much about re-running the same computation in two different jobs if it means that less data will be written to disk overall, since most MR jobs aren't CPU bound. While that's often a useful heuristic, there are many cases where it isn't true, and this sounds like one of them. My advice would be to materialize the output of the union of S2 and S3, at which point the planner should run the processing of S2 and S3 once at the end of job 1, and then pick up that materialized output for grouping in job 2. Best, Josh On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi wr= ote: > Background: > We have a crunch pipeline which is used to normalize and standardize some > entities represented as Avro. In our pipeline, we also capture some conte= xt > information about the errors and warnings which we encounter during our > processing. We pass a pair of context information and Avro entities in ou= r > pipeline. At the end of the pipeline, the context information is written = to > HDFS and Avro entities are written to HFiles. > > Problem: > When we were trying to analyze DAG for our crunch pipeline we noticed tha= t same processing is done in two m/r jobs. Once it is done to capture conte= xt information and second time it is done to generate HFiles. I wrote a tes= t which replicates this issue with a simple example. The test and a DAG cre= ated from this test are attached with the post. It is clear from the DAG th= at S2 and S3 are processed twice. I am not sure why this processing is done= twice and if there is any way to avoid this behavior. > > Surbhi Mungre > > Software Engineer > www.cerner.com > > > CONFIDENTIALITY NOTICE This message and any included attachments are > from Cerner Corporation and are intended only for the addressee. The > information contained in this message is confidential and may constitute > inside or non-public information under international, federal, or state > securities laws. Unauthorized forwarding, printing, copying, distribution= , > or use of such information is strictly prohibited and may be unlawful. If > you are not the addressee, please promptly delete this message and notify > the sender of the delivery error by e-mail or you may call Cerner's > corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > --=20 Director of Data Science Cloudera Twitter: @josh_wills --089e01538692c0219504eafd4fa1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hey Surbhi,

The planner is trying to mi= nimize the amount of data it writes to disk at the end of the first job; it= doesn't usually worry so much about re-running the same computation in= two different jobs if it means that less data will be written to disk over= all, since most MR jobs aren't CPU bound.

While that's often a useful heuristic, there are many cases wh= ere it isn't true, and this sounds like one of them. My advice would be= to materialize the output of the union of S2 and S3, at which point the pl= anner should run the processing of S2 and S3 once at the end of job 1, and = then pick up that materialized output for grouping in job 2.

Best,
Josh


On Mon, Nov 11, 2013 at 8:30 PM, Mun= gre,Surbhi <Surbhi.Mungre@cerner.com> wrote:
Background:
We ha= ve a crunch pipeline which is used to normalize and standardize some entiti= es represented as Avro. In our pipeline, we also capture some context infor= mation about the errors and warnings which we encounter during our processing. We pass a pair of context inform= ation and Avro entities in our pipeline. At the end of the pipeline, the co= ntext information is written to HDFS and Avro entities are written to HF= iles.=A0
Problem:
When we were trying to analyze DAG fo= r our crunch pipeline we noticed that same processing is done in two m/r jo= bs. Once it is done to capture context information and second tim= e it is done to generate HFiles. I wrote a test which replicates this issue= with a simple example. The test and a DAG created from this test are attac= hed with the post. It is clear from the DAG that S2 and S3 are processed tw= ice. I am not sure why this processing is done twice and if there is any wa= y to avoid this behavior.

Surbhi Mungre=A0
Software Engineer


CONFIDENTIALITY NOTICE This message and any included attachments are from C= erner Corporation and are intended only for the addressee. The information = contained in this message is confidential and may constitute inside or non-= public information under international, federal, or state securities laws. = Unauthorized forwarding, printing, copying, distribution, or use of such in= formation is strictly prohibited and may be unlawful. If you are not the ad= dressee, please promptly delete this message and notify the sender of the d= elivery error by e-mail or you may call Cerner's corporate offices in K= ansas City, Missouri, U.S.A at (+1) (816)221-1024.



--
Directo= r of Data Science
Twitter: @josh_wills
--089e01538692c0219504eafd4fa1--