Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D469510314 for ; Tue, 18 Jun 2013 01:35:11 +0000 (UTC) Received: (qmail 21977 invoked by uid 500); 18 Jun 2013 01:35:10 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 21910 invoked by uid 500); 18 Jun 2013 01:35:10 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 21902 invoked by uid 99); 18 Jun 2013 01:35:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Jun 2013 01:35:09 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sclowes@atlassian.com designates 74.125.149.83 as permitted sender) Received: from [74.125.149.83] (HELO na3sys009aog134.obsmtp.com) (74.125.149.83) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 18 Jun 2013 01:35:03 +0000 Received: from mail-vc0-f179.google.com ([209.85.220.179]) (using TLSv1) by na3sys009aob134.postini.com ([74.125.148.12]) with SMTP ID DSNKUb+5L97xWoaISgXLPY0bUyOVs9JCZ/FC@postini.com; Mon, 17 Jun 2013 18:34:42 PDT Received: by mail-vc0-f179.google.com with SMTP id hz11so2491975vcb.24 for ; Mon, 17 Jun 2013 18:34:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=wkJge1ULG/AV9bgWwD7e9kbwN6oesscggOKxRW7Sns4=; b=g9DNYGt4Aoo01l597RlT2WTkt8plMHa2KkZ2RfC2H0u3ot/woXb6G4pfQbW/kZCEk3 htB8lwmPQf4uW/ppbPTnlLVrySLy34/fY+OCC1ND7iUJNH2mnvRdfRYRi0pcQpds3+9r AlJORsud3V32WKXGjY+/8EReUoA61J9UE+q5r2vdBR32f+BD2Nx1EyT3LE39zPqGxUaV SIuL03SbDiQYMUv0XkGHvGZiF3r7MOi/Rjew+91QY7Ymi9eupctF0yNnjFLq2Qrq5/1J fJqQoBetlpEgzE7DA6nSVwJBpaUy+yxnbf1bw1M3+F77skAm2L4xn641olqnxuAlz6Hk HhvQ== X-Received: by 10.220.67.131 with SMTP id r3mr5483264vci.21.1371519279025; Mon, 17 Jun 2013 18:34:39 -0700 (PDT) X-Received: by 10.220.67.131 with SMTP id r3mr5483258vci.21.1371519278702; Mon, 17 Jun 2013 18:34:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.189.68 with HTTP; Mon, 17 Jun 2013 18:34:18 -0700 (PDT) In-Reply-To: References: From: Shaun Clowes Date: Tue, 18 Jun 2013 11:34:18 +1000 Message-ID: Subject: Re: Extremely slow throughput with dynamic partitions using Hive 0.8.1 in Amazon Elastic Mapreduce To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7b3a967068fdd404df63b789 X-Gm-Message-State: ALoCoQmqX6DLMysVADENVCfjGAcPDTzoRet8JPMyfqsc5arAof83tnGeXyrPdCb2ORTJdnBwCF6ucUkMwoJNOHVaLedp90WmEvHaZR8hVM4rr2erZhXY0iO1PU65Eveq3nvWILK4HMvcmBvVtHpXaMKyVjZOy5wKug== X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3a967068fdd404df63b789 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for following up Ted, I couldn't work out why the progress tracking was being forced on for Dynamic Partition inserts so thanks for your helpful explanation. I'll raise a JIRA issue regarding the problem. Do you have any idea for an alternate approach? I could have a go at implementing a fix but I'm not sure in what a better alternative might be. In the mean time I've implemented a semantic hook that will remove the counters that are added to the operators. The source is below encase anyone else finds it useful. To use it: - Start hive with the jar loaded, "export HIVE_AUX_JARS_PATH=3Dpath/to/jar" - Add the hook, "set hive.semantic.analyzer.hook=3Dcom.atlassian.hive.RemoveCountersHook;" Thanks, Shaun /* * Remove counters from tasks (this drastically speeds up dynamic * partition inserts in Amazon EMR). The counters are enabled because * hive.task.progress is forced on in SemanticAnalyzer.java for * dynamic partition insert queries. Ted Xu explains that this is * simply so that the job can be killed if the maximum number of * dynamic partitions is exceeded in the following mailing list * message: * * http://mail-archives.apache.org/mod_mbox/hive-user/201306.mbox/%3CCAP9%2B16= xbCfvCc%3DgiKW4a9GQ588sZhGYiiKH7DS5CH9nr07i-ug%40mail.gmail.com%3E * * If you are sure the limit will not be exceeded removing the * counters is extremely beneficial. * * Copyright =A9 2013 Atlassian Corporation Pty Ltd. Licensed * under the Apache License, Version 2.0 (the "License"); you may not use * this file except in compliance with the License. You may obtain a copy of * the License at http://www.apache.org/licenses/LICENSE-2.0. Unless * required by applicable law or agreed to in writing, software distributed * under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES * OR CONDITIONS OF ANY KIND, either express or implied. See the License * for the specific language governing permissions and limitations under the License. * * There are three types of hooks all of which are executed in Driver.java * BEFORE the map reduce jobs are submitted (i.e they are processed in the * Hive driver/CLI): * * - Semantic Analyzer Hooks - Declared with comma separated class * names in hive.semantic.analyzer.hook. Called both before and * after semantic analysis is completed. Implements * AbstractSemanticAnalyzerHook * - Pre hooks - Declared with comma separated class names in * hive.exec.pre.hooks. Run after the job has completed, can * implement ExecuteWithHookContext or PostExecute * - Post hooks - Declared with comma separated class names in * hive.exec.post.hooks. Run after the job has completed, can * implement ExecuteWithHookContext or PostExecute */ package com.atlassian.hive; import java.lang.RuntimeException; import java.io.IOException; import java.io.Serializable; import java.util.HashMap; import java.util.List; import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.Operator.ProgressCounter; import org.apache.hadoop.hive.ql.exec.DDLTask; import org.apache.hadoop.hive.ql.exec.Task; import org.apache.hadoop.hive.ql.parse.ASTNode; import org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook; import org.apache.hadoop.hive.ql.parse.HiveParser; import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext; import org.apache.hadoop.hive.ql.parse.SemanticException; import org.apache.hadoop.hive.ql.plan.CreateTableDesc; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.session.SessionState; import org.apache.hadoop.hive.ql.session.SessionState.LogHelper; import org.apache.hadoop.hive.ql.exec.Operator; import org.apache.hadoop.hive.ql.exec.ConditionalTask; import org.apache.hadoop.hive.ql.exec.ExecDriver; import org.apache.hadoop.hive.ql.plan.MapredWork; public class RemoveCountersHook extends AbstractSemanticAnalyzerHook { private static final Log LOG =3D LogFactory.getLog(RemoveCountersHook.class); @Override public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context, ASTNode ast) throws SemanticException { //LogHelper console =3D SessionState.getConsole(); //console.printInfo("RemoveCountersHook called for preAnalyze"); LOG.info("RemoveCountersHook called for preAnalyze"); return ast; } @Override public void postAnalyze(HiveSemanticAnalyzerHookContext context, List> rootTasks) throws SemanticException { LogHelper console =3D SessionState.getConsole(); LOG.info("RemoveCountersHook called for postAnalyze"); LOG.info("Context " + context); LOG.info("Root tasks " + rootTasks); for (Task tsk : rootTasks) removeOperatorCountersFromTask(tsk); return; } public void removeOperatorCountersFromTask(Task task) { Operator.resetLastEnumUsed(); if (task instanceof ExecDriver) { HashMap> opMap =3D ((MapredWork) task .getWork()).getAliasToWork(); if (!opMap.isEmpty()) { for (Operator op : opMap.values()) { removeOperatorCountersFromOp(task, op); } } Operator reducer =3D ((MapredWork) task.getWork()) .getReducer(); if (reducer !=3D null) { removeOperatorCountersFromOp(task, reducer); } } else if (task instanceof ConditionalTask) { List> listTasks =3D ((ConditionalTask) task) .getListTasks(); for (Task tsk : listTasks) { removeOperatorCountersFromTask(tsk); } } if (task.getChildTasks() =3D=3D null) { return; } for (Task childTask : task.getChildTasks()) { removeOperatorCountersFromTask(childTask); } } private void removeOperatorCountersFromOp(Task task, Operator op) { HashMap counterNameToEnum =3D op.getCounterNameToEnum(); if (counterNameToEnum =3D=3D null || counterNameToEnum.size() =3D=3D 0) LOG.info("No counters to remove from operator " + op); else { LOG.info("Removing " + counterNameToEnum.size() + " counters from operator " + op + " in task " + task); // op.setCounterNames(new ArrayList()); op.setCounterNameToEnum(null); } if (op.getChildOperators() =3D=3D null) { return; } for (Operator child : op.getChildOperators()) { removeOperatorCountersFromOp(task, child); } } } On 17 June 2013 19:48, Ted Xu wrote: > Hi Shaun, > > Your findings are valid. Hive uses Hadoop job counters to report fatal > error, so the client can kill the MapReduce job before it completes. > > With regard to your case, because Hive wants to kill the MapReduce job > when there is too many partitions using Dynamic Partitioning, counters > report is forced to enable. IMHO, fatal error report should not depend on > the "job progress" switch. You can file a JIRA ticket on this one. > > > On Fri, Jun 7, 2013 at 1:55 PM, Shaun Clowes wrote= : > >> Hi Ted, All, >> >> Unfortunately profiling turns out to be extremely slow, so it's not very >> fruitful for determining what's going on here. >> >> On the other hand I seem to have traced this problem down to the >> "hive.task.progress" configuration variable. When this is set to true (a= s >> it is automatically when a dynamic partition insert it used), the insert= is >> drastically slower than it is otherwise. >> >> In SemanticAnalyzer.java it forces this task tracking on as follows: >> >> // turn on hive.task.progress to update # of partitions create= d >> to the JT >> HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS, >> true); >> >> Does anyone know why this must be turned on? What is the need for the >> number of partitions created to be reported? The end result is a lot mor= e >> than just the number of partitions having their statistics reported. >> >> I'm not sure why the insert is so very slow when it's on, perhaps the >> retrieval of the current time in millis in Operator.java: >> >> 1076 /** >> 1077 * this is called after operator process to buffer some counters. >> 1078 */ >> 1079 private void postProcessCounter() { >> 1080 if (counterNameToEnum !=3D null) { >> 1081 totalTime +=3D (System.currentTimeMillis() - beginTime); >> 1082 } >> 1083 } >> >> Thanks, >> Shaun >> >> >> On 6 June 2013 19:00, Ted Xu wrote: >> >>> Hi Shaun, >>> >>> This is weird. I'm not sure if there is any other reasons (e.g., a very >>> complex UDF?) caused this issue, but it would be the best if you can do= a >>> profiling, >>> see if there is hot spot. >>> >>> >>> On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes wro= te: >>> >>>> Hi Ted, >>>> >>>> It's actually just one partition being created which is what makes it >>>> so weird. >>>> >>>> Thanks, >>>> Shaun >>>> >>>> >>>> On 6 June 2013 18:36, Ted Xu wrote: >>>> >>>>> Hi Shaun, >>>>> >>>>> Too many partitions in dynamic partitioning may slow down the >>>>> mapreduce job. Can you estimate how many partitions will be generated= after >>>>> insert? >>>>> >>>>> >>>>> On Thu, Jun 6, 2013 at 4:24 PM, Shaun Clowes w= rote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> Does anyone know the performance impact the dynamic partitions shoul= d >>>>>> be expected to have? >>>>>> >>>>>> I have a table that is partitioned by a string in the form 'YYYY-MM'= . >>>>>> When I insert in to this table (from an external table that is just = an S3 >>>>>> bucket containing gzipped logs) using dynamic partitioning I get ver= y slow >>>>>> performance with each node in the cluster unable to process more tha= n 2MB >>>>>> per second. When I run the exact same query with static partition va= lues I >>>>>> get more about 30-40MB/s on each node. >>>>>> >>>>>> I've never seen this type of problem with our internal cluster >>>>>> running Hive 0.7.1 (CDH3u4), but it happens every time in EMR. >>>>>> >>>>>> Thanks, >>>>>> Shaun >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Ted Xu >>>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Ted Xu >>> >> >> > > > -- > Regards, > Ted Xu > --047d7b3a967068fdd404df63b789 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks for following up Ted, I couldn't work out why t= he progress tracking was being forced on for Dynamic Partition inserts so t= hanks for your helpful explanation. I'll raise a JIRA issue regarding t= he problem. Do you have any idea for an alternate approach? I could have a = go at implementing a fix but I'm not sure in what a better alternative = might be.=A0

In the mean time I've implemented a semantic hook that w= ill remove the counters that are added to the operators. The source is belo= w encase anyone else finds it useful. To use it:

- Start hive with the jar loaded, "export HIVE_AUX_JARS_PATH=3Dpath/to= /jar"
- Add the hook, "set hive.semantic.analyzer= .hook=3Dcom.atlassian.hive.RemoveCountersHook;"

Thanks,
Shaun

/*=A0
=A0* Remove counters from tasks (this drast= ically speeds up dynamic=A0
=A0* partition inserts in Amazon= EMR). The counters are enabled because
=A0* hive.task.progress is forced on in SemanticAnalyze= r.java for=A0
=A0* dynamic partition insert qu= eries. Ted Xu explains that this is
=A0* simply so that the job can be killed if the maximum nu= mber of
=A0* dynamic partitions is excee= ded in the following mailing list=A0
=A0* message:
=A0*
=A0*
=A0* If you are sure the limit will not be exce= eded removing the=A0
=A0* counters is extremely beneficial.
=A0*=A0
=A0* Copyright =A9 2013 Atlassian Corporatio= n Pty Ltd. Licensed=A0
=A0* under the Apache License, Version 2.0 (the "License"); yo= u may not use=A0
=A0* this file except in complia= nce with the License. You may obtain a copy of=A0
=A0* required by applicable law = or agreed to in writing, software distributed=A0
=A0* under the License is distributed on an &q= uot;AS IS" BASIS, WITHOUT WARRANTIES=A0
=A0* OR CONDITIONS OF ANY KIND, = either express or implied. See the License=A0
=A0* for the specific language governing permis= sions and limitations under the License.
=A0*=A0
=A0* There are three types of hooks all of w= hich are executed in Driver.java
=A0* BEFORE the map reduce jobs are submitted (i.e they are pr= ocessed in the=A0
=A0* Hive driver/CLI):=A0=
=A0*=A0
<= font face=3D"courier new, monospace">=A0* =A0 - Semantic Analyzer Hooks - D= eclared with comma separated class=A0
=A0* =A0 =A0 names in hive.seman= tic.analyzer.hook. Called both before and=A0
=A0* =A0 =A0 after semantic analysis is completed.= Implements=A0
=A0* =A0 =A0 AbstractSemanticAna= lyzerHook
=A0* =A0 -= Pre hooks - Declared with comma separated class names in=A0
=A0* =A0 =A0 hive.exec.pre.hooks. = Run after the job has completed, can=A0
=A0* =A0 =A0 implement ExecuteWi= thHookContext or PostExecute
=A0* =A0 - Post hooks - Declared with comma separated class names = in=A0
=A0* =A0 =A0 hive.exec.post.hook= s. Run after the job has completed, can=A0
=A0* =A0 =A0 implement ExecuteWithHookContext or Pos= tExecute
=A0*/
package com.atlassian.hive;
<= font face=3D"courier new, monospace">
import java.lang.RuntimeException;
import java.io.IOException;

import java.io.Serializable;
import java.util.HashMap;=
import java.util.List;
import java.util.ArrayL= ist;

import org.apache.hadoop.hive.ql.exec.Operator.= ProgressCounter;
imp= ort org.apache.hadoop.hive.ql.exec.DDLTask;
import org.apache.hadoop.hive.ql= .exec.Task;
import o= rg.apache.hadoop.hive.ql.parse.ASTNode;
import org.apache.hadoop.hive.ql.parse.AbstractSemantic= AnalyzerHook;
import org.apache.hadoop.hive.ql= .parse.HiveParser;
i= mport org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext;
import org.apache.hadoop.hive.ql= .parse.SemanticException;
import org.apache.hadoop.hive.ql.plan.CreateTableDesc;

import org.apache.commons.logging.Log;
<= div>import org.apache.commons.logging= .LogFactory;

import org.apache.hadoop.hive.ql.session.Sessio= nState;
import org.a= pache.hadoop.hive.ql.session.SessionState.LogHelper;

import org.apache.hadoop.hive.ql.exec.Operator;=
import org.apache.h= adoop.hive.ql.exec.ConditionalTask;
import org.apache.hadoop.hive.ql= .exec.ExecDriver;
im= port org.apache.hadoop.hive.ql.plan.MapredWork;

public class Remove= CountersHook extends AbstractSemanticAnalyzerHook {

=A0 =A0private static final Log LOG =3D LogFactory.getLog(= RemoveCountersHook.class);

=A0 =A0@Override
=A0 =A0public ASTNode preAnalyze(HiveSemanticAnaly= zerHookContext context, ASTNode ast)
=A0 =A0 =A0 throws SemanticExcep= tion {
=A0 =A0 =A0 /= /LogHelper console =3D SessionState.getConsole();
=A0 =A0 =A0 //console.printInfo("RemoveC= ountersHook called for preAnalyze");
=A0 =A0 =A0 LOG.info("Remov= eCountersHook called for preAnalyze");
=A0 =A0 =A0 return ast;
=A0 =A0}


=A0 =A0@Override
=A0 =A0public void postAnalyze(HiveSemanticAnalyzerHookContext co= ntext,
=A0 =A0 =A0 List<Task<? ex= tends Serializable>> rootTasks) throws SemanticException {
=A0 =A0 =A0 LogHelper console = =3D SessionState.getConsole();
=A0 =A0 =A0 LOG.info("Remov= eCountersHook called for postAnalyze");
=A0 =A0 =A0 LOG.info("Context " + contex= t);
=A0 =A0 =A0 LOG.info("Root = tasks " + rootTasks);

=A0 =A0 = =A0 for (Task<? extends Serializable> tsk : rootTasks)
=A0 =A0 =A0 =A0 =A0removeOperato= rCountersFromTask(tsk);

=A0 =A0 =A0= return;
=A0 =A0}

=A0 =A0public void removeOperatorCountersFromTask(Task<= ? extends Serializable> task)
=A0 =A0{
=A0 =A0 Operator.resetLastEnumUsed();
=A0 =A0=A0
=A0 =A0 if (task instanceof ExecDr= iver) {
=A0 =A0 =A0 HashMap<String, O= perator<? extends Serializable>> opMap =3D ((MapredWork) task
=A0 =A0 =A0 =A0 =A0 .get= Work()).getAliasToWork();
=A0 =A0 =A0 if (!opMap.isEmpty()= ) {
=A0 =A0 =A0 =A0 = for (Operator<? extends Serializable> op : opMap.values()) {
=A0 =A0 =A0 =A0 =A0 removeOp= eratorCountersFromOp(task, op);
=A0 =A0 =A0 =A0 }
=A0 =A0 =A0 }

=A0 =A0 =A0 Operator<? extends Serializable> redu= cer =3D ((MapredWork) task.getWork())
=A0 =A0 =A0 =A0 =A0 .getReducer(= );
=A0 =A0 =A0 if (r= educer !=3D null) {
= =A0 =A0 =A0 =A0 removeOperatorCountersFromOp(task, reducer);
=A0 =A0 =A0 }
<= font face=3D"courier new, monospace">=A0 =A0 } else if (task instanceof Con= ditionalTask) {
=A0 = =A0 =A0 List<Task<? extends Serializable>> listTasks =3D ((Cond= itionalTask) task)
=A0 =A0 =A0 =A0 =A0 .getListTask= s();
=A0 =A0 =A0 for= (Task<? extends Serializable> tsk : listTasks) {
=A0 =A0 =A0 =A0 removeOperatorCountersF= romTask(tsk);
=A0 =A0 =A0 }
<= font face=3D"courier new, monospace">=A0 =A0 }

=A0 =A0 if (task.getChildTasks() =3D=3D null) {
=A0 =A0 =A0 return;
=
=A0 =A0 }

=A0 =A0 for (Task<? extends Serializable> childTask= : task.getChildTasks()) {
=A0 =A0 =A0 removeOperatorCounte= rsFromTask(childTask);
=A0 =A0 }
=A0 =A0= }

=A0 private void removeOperatorCountersFromOp(Task&l= t;? extends Serializable> task, Operator<? extends Serializable> o= p) {
=A0 =A0 HashMap<String, Progr= essCounter> counterNameToEnum =3D op.getCounterNameToEnum();
=A0 =A0 if (counterNameToEnum = =3D=3D null || counterNameToEnum.size() =3D=3D 0)
=A0 =A0 =A0 LOG.info("No co= unters to remove from operator " + op);
=A0 =A0 else
=A0 =A0 {
=A0 =A0 =A0 LOG.info("Remov= ing " + counterNameToEnum.size() + " counters from operator "= ; + op + " in task " + task);
=A0 =A0 =A0 // op.setCounterNames(new ArrayList<Stri= ng>());
=A0 =A0 =A0 op.setCounterNameToE= num(null);
=A0 =A0 }=
=A0 =A0=A0
=A0 =A0 if (op.getChildOpera= tors() =3D=3D null) {
=A0 =A0 =A0 return;
=
=A0 =A0 }

=A0 =A0 for (Operator<? extends Serializable> child= : op.getChildOperators()) {
=A0 =A0 =A0 removeOperatorCounte= rsFromOp(task, child);
=A0 =A0 }
=A0 }
}




On 17 June 2013 19:48, Ted Xu <txu@gopivotal.com><= /span> wrote:
Hi Shaun,

Your findings are valid. Hive uses Hadoop job counters to report fatal er= ror, so the client can kill the MapReduce job before it completes.

With regard to your case, because Hive wants to kill the MapReduce job when= there is too many partitions using Dynamic Partitioning, counters report i= s forced to enable. IMHO, fatal error report should not depend on the "= ;job progress" switch. You can file a JIRA ticket on this one.


On Fri, Jun 7, 2013 at 1:55 PM, Shaun Clowes <sclowes= @atlassian.com> wrote:
Hi Ted, All,=A0

Unfortunately profiling turns out to be extremely slow, so it's= not very fruitful for determining what's going on here.
=A0
On the other hand I seem to have traced this problem dow= n to the "hive.task.progress" configuration variable. When this i= s set to true (as it is automatically when a dynamic partition insert it us= ed), the insert is drastically slower than it is otherwise.=A0

In SemanticAnalyzer.java it forces this task tracking o= n as follows:

=A0 =A0 =A0 =A0 =A0 // turn on = hive.task.progress to update # of partitions created to the JT
=A0 =A0 =A0 =A0 =A0 HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJO= BPROGRESS, true);

Does anyone know why this must b= e turned on? What is the need for the number of partitions created to be re= ported? The end result is a lot more than just the number of partitions hav= ing their statistics reported.

I'm not sure why the insert is so very slow when it= 's on, perhaps the retrieval of the current time in millis in Operator.= java:

1076 =A0 /**
1077 =A0 =A0* this is called after operator process to buffer some cou= nters.
1078 =A0 =A0*/
1079 =A0 private void postProcess= Counter() {
1080 =A0 =A0 if (counterNameToEnum !=3D null) {
=
1081 =A0 =A0 =A0 totalTime +=3D (System.currentTimeMillis() - beginTim= e);
1082 =A0 =A0 }
1083 =A0 }

Thanks,
Shaun


On 6 June 2013 19:00, Ted Xu <txu@gopivotal.com> wrote:
Hi Shaun,

This is weird. I'm not su= re if there is any other reasons (e.g., a very complex UDF?) caused this is= sue, but it would be the best if you can do a profil= ing, see if there is hot spot.=A0


On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes <sclowes@atlassian.com= > wrote:
Hi Ted,

=
It's actually just one partition being created which is what make= s it so weird.=A0

Thanks,
Shaun


On 6 June 2013 18:36, Ted Xu <txu@gop= ivotal.com> wrote:
Hi Shaun,

Too many partitions in dynami= c partitioning may slow down the mapreduce job. Can you estimate how many p= artitions will be generated after insert?


On Thu, Jun 6, 2013 at 4:24 PM, Shaun Cl= owes <sclowes@atlassian.com> wrote:
Hi All,

Does anyone know the performanc= e impact the dynamic partitions should be expected to have?

<= /div>
I have a table that is partitioned by a string in the form 'Y= YYY-MM'. When I insert in to this table (from an external table that is= just an S3 bucket containing gzipped logs) using dynamic partitioning I ge= t very slow performance with each node in the cluster unable to process mor= e than 2MB per second. When I run the exact same query with static partitio= n values I get more about 30-40MB/s on each node.=A0

I've never seen this type of problem with our inter= nal cluster running Hive 0.7.1 (CDH3u4), but it happens every time in EMR.= =A0

Thanks,
Shaun



<= font color=3D"#888888">--
Regards,
Ted Xu




<= /div>--
Regards,
Ted Xu




<= /div>--
Regards,
Ted Xu

--047d7b3a967068fdd404df63b789--