Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of sclowes@atlassian.com
 designates 74.125.149.83 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAP9+16xbCfvCc=giKW4a9GQ588sZhGYiiKH7DS5CH9nr07i-ug@mail.gmail.com>
References: 
 <CAJzBGaG7RMh46+5YWxDuLb1PJmpg08SXgMqn-iOS4iXaR88h7g@mail.gmail.com>
 <CAP9+16wbJUx+BwdgRBP0z=vcYo+WRKcpC3cfwQGD4Yrwvuh1QA@mail.gmail.com>
 <CAJzBGaFUZariJRBZRw5O+zDxZmmweSmaoKyF=Y6zvPYxjmZ9TQ@mail.gmail.com>
 <CAP9+16yKZpyuNgRvQ3UH-vWkSBqVGvptDrSvwpMiirk9WTMHiA@mail.gmail.com>
 <CAJzBGaHyVYT549f+-KZzzVCqt3LNrTUZdhHUcHLo25nD+mcf9w@mail.gmail.com>
 <CAP9+16xbCfvCc=giKW4a9GQ588sZhGYiiKH7DS5CH9nr07i-ug@mail.gmail.com>
From: Shaun Clowes <sclowes@atlassian.com>
Date: Tue, 18 Jun 2013 11:34:18 +1000
Message-ID: 
 <CAJzBGaEesanG101XcF2W1Y7F1g3yKTh+963n4vfJ2sxOc6-mgw@mail.gmail.com>
Subject: Re: Extremely slow throughput with dynamic partitions using Hive
 0.8.1 in Amazon Elastic Mapreduce
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=047d7b3a967068fdd404df63b789

--047d7b3a967068fdd404df63b789
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks for following up Ted, I couldn't work out why the progress tracking
was being forced on for Dynamic Partition inserts so thanks for your
helpful explanation. I'll raise a JIRA issue regarding the problem. Do you
have any idea for an alternate approach? I could have a go at implementing
a fix but I'm not sure in what a better alternative might be.

In the mean time I've implemented a semantic hook that will remove the
counters that are added to the operators. The source is below encase anyone
else finds it useful. To use it:

- Start hive with the jar loaded, "export HIVE_AUX_JARS_PATH=3Dpath/to/jar"
- Add the hook, "set
hive.semantic.analyzer.hook=3Dcom.atlassian.hive.RemoveCountersHook;"

Thanks,
Shaun

/*
 * Remove counters from tasks (this drastically speeds up dynamic
 * partition inserts in Amazon EMR). The counters are enabled because
 * hive.task.progress is forced on in SemanticAnalyzer.java for
 * dynamic partition insert queries. Ted Xu explains that this is
 * simply so that the job can be killed if the maximum number of
 * dynamic partitions is exceeded in the following mailing list
 * message:
 *
 *
http://mail-archives.apache.org/mod_mbox/hive-user/201306.mbox/%3CCAP9%2B16=
xbCfvCc%3DgiKW4a9GQ588sZhGYiiKH7DS5CH9nr07i-ug%40mail.gmail.com%3E

 *
 * If you are sure the limit will not be exceeded removing the
 * counters is extremely beneficial.
 *
 * Copyright =A9 2013 Atlassian Corporation Pty Ltd. Licensed
 * under the Apache License, Version 2.0 (the "License"); you may not use
 * this file except in compliance with the License. You may obtain a copy
of
 * the License at http://www.apache.org/licenses/LICENSE-2.0. Unless
 * required by applicable law or agreed to in writing, software distributed
 * under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
 * OR CONDITIONS OF ANY KIND, either express or implied. See the License
 * for the specific language governing permissions and limitations under
the License.
 *
 * There are three types of hooks all of which are executed in Driver.java
 * BEFORE the map reduce jobs are submitted (i.e they are processed in the
 * Hive driver/CLI):
 *
 *   - Semantic Analyzer Hooks - Declared with comma separated class
 *     names in hive.semantic.analyzer.hook. Called both before and
 *     after semantic analysis is completed. Implements
 *     AbstractSemanticAnalyzerHook
 *   - Pre hooks - Declared with comma separated class names in
 *     hive.exec.pre.hooks. Run after the job has completed, can
 *     implement ExecuteWithHookContext or PostExecute
 *   - Post hooks - Declared with comma separated class names in
 *     hive.exec.post.hooks. Run after the job has completed, can
 *     implement ExecuteWithHookContext or PostExecute
 */
package com.atlassian.hive;

import java.lang.RuntimeException;
import java.io.IOException;

import java.io.Serializable;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;

import org.apache.hadoop.hive.ql.exec.Operator.ProgressCounter;
import org.apache.hadoop.hive.ql.exec.DDLTask;
import org.apache.hadoop.hive.ql.exec.Task;
import org.apache.hadoop.hive.ql.parse.ASTNode;
import org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook;
import org.apache.hadoop.hive.ql.parse.HiveParser;
import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.plan.CreateTableDesc;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.hive.ql.session.SessionState;
import org.apache.hadoop.hive.ql.session.SessionState.LogHelper;

import org.apache.hadoop.hive.ql.exec.Operator;
import org.apache.hadoop.hive.ql.exec.ConditionalTask;
import org.apache.hadoop.hive.ql.exec.ExecDriver;
import org.apache.hadoop.hive.ql.plan.MapredWork;

public class RemoveCountersHook extends AbstractSemanticAnalyzerHook {

   private static final Log LOG =3D
LogFactory.getLog(RemoveCountersHook.class);

   @Override
   public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,
ASTNode ast)
      throws SemanticException {
      //LogHelper console =3D SessionState.getConsole();
      //console.printInfo("RemoveCountersHook called for preAnalyze");
      LOG.info("RemoveCountersHook called for preAnalyze");
      return ast;
   }


   @Override
   public void postAnalyze(HiveSemanticAnalyzerHookContext context,
      List<Task<? extends Serializable>> rootTasks) throws
SemanticException {
      LogHelper console =3D SessionState.getConsole();
      LOG.info("RemoveCountersHook called for postAnalyze");
      LOG.info("Context " + context);
      LOG.info("Root tasks " + rootTasks);

      for (Task<? extends Serializable> tsk : rootTasks)
         removeOperatorCountersFromTask(tsk);

      return;
   }

   public void removeOperatorCountersFromTask(Task<? extends Serializable>
task)
   {
    Operator.resetLastEnumUsed();

    if (task instanceof ExecDriver) {
      HashMap<String, Operator<? extends Serializable>> opMap =3D
((MapredWork) task
          .getWork()).getAliasToWork();
      if (!opMap.isEmpty()) {
        for (Operator<? extends Serializable> op : opMap.values()) {
          removeOperatorCountersFromOp(task, op);
        }
      }

      Operator<? extends Serializable> reducer =3D ((MapredWork)
task.getWork())
          .getReducer();
      if (reducer !=3D null) {
        removeOperatorCountersFromOp(task, reducer);
      }
    } else if (task instanceof ConditionalTask) {
      List<Task<? extends Serializable>> listTasks =3D ((ConditionalTask)
task)
          .getListTasks();
      for (Task<? extends Serializable> tsk : listTasks) {
        removeOperatorCountersFromTask(tsk);
      }
    }

    if (task.getChildTasks() =3D=3D null) {
      return;
    }

    for (Task<? extends Serializable> childTask : task.getChildTasks()) {
      removeOperatorCountersFromTask(childTask);
    }
   }

  private void removeOperatorCountersFromOp(Task<? extends Serializable>
task, Operator<? extends Serializable> op) {
    HashMap<String, ProgressCounter> counterNameToEnum =3D
op.getCounterNameToEnum();
    if (counterNameToEnum =3D=3D null || counterNameToEnum.size() =3D=3D 0)
      LOG.info("No counters to remove from operator " + op);
    else
    {
      LOG.info("Removing " + counterNameToEnum.size() + " counters from
operator " + op + " in task " + task);
      // op.setCounterNames(new ArrayList<String>());
      op.setCounterNameToEnum(null);
    }

    if (op.getChildOperators() =3D=3D null) {
      return;
    }

    for (Operator<? extends Serializable> child : op.getChildOperators()) {
      removeOperatorCountersFromOp(task, child);
    }
  }
}


On 17 June 2013 19:48, Ted Xu <txu@gopivotal.com> wrote:

> Hi Shaun,
>
> Your findings are valid. Hive uses Hadoop job counters to report fatal
> error, so the client can kill the MapReduce job before it completes.
>
> With regard to your case, because Hive wants to kill the MapReduce job
> when there is too many partitions using Dynamic Partitioning, counters
> report is forced to enable. IMHO, fatal error report should not depend on
> the "job progress" switch. You can file a JIRA ticket on this one.
>
>
> On Fri, Jun 7, 2013 at 1:55 PM, Shaun Clowes <sclowes@atlassian.com>wrote=
:
>
>> Hi Ted, All,
>>
>> Unfortunately profiling turns out to be extremely slow, so it's not very
>> fruitful for determining what's going on here.
>>
>> On the other hand I seem to have traced this problem down to the
>> "hive.task.progress" configuration variable. When this is set to true (a=
s
>> it is automatically when a dynamic partition insert it used), the insert=
 is
>> drastically slower than it is otherwise.
>>
>> In SemanticAnalyzer.java it forces this task tracking on as follows:
>>
>>           // turn on hive.task.progress to update # of partitions create=
d
>> to the JT
>>           HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJOBPROGRESS,
>> true);
>>
>> Does anyone know why this must be turned on? What is the need for the
>> number of partitions created to be reported? The end result is a lot mor=
e
>> than just the number of partitions having their statistics reported.
>>
>> I'm not sure why the insert is so very slow when it's on, perhaps the
>> retrieval of the current time in millis in Operator.java:
>>
>> 1076   /**
>> 1077    * this is called after operator process to buffer some counters.
>> 1078    */
>> 1079   private void postProcessCounter() {
>> 1080     if (counterNameToEnum !=3D null) {
>> 1081       totalTime +=3D (System.currentTimeMillis() - beginTime);
>> 1082     }
>> 1083   }
>>
>> Thanks,
>> Shaun
>>
>>
>> On 6 June 2013 19:00, Ted Xu <txu@gopivotal.com> wrote:
>>
>>> Hi Shaun,
>>>
>>> This is weird. I'm not sure if there is any other reasons (e.g., a very
>>> complex UDF?) caused this issue, but it would be the best if you can do=
 a
>>> profiling<http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Pro=
filing>,
>>> see if there is hot spot.
>>>
>>>
>>> On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes <sclowes@atlassian.com>wro=
te:
>>>
>>>> Hi Ted,
>>>>
>>>> It's actually just one partition being created which is what makes it
>>>> so weird.
>>>>
>>>> Thanks,
>>>> Shaun
>>>>
>>>>
>>>> On 6 June 2013 18:36, Ted Xu <txu@gopivotal.com> wrote:
>>>>
>>>>> Hi Shaun,
>>>>>
>>>>> Too many partitions in dynamic partitioning may slow down the
>>>>> mapreduce job. Can you estimate how many partitions will be generated=
 after
>>>>> insert?
>>>>>
>>>>>
>>>>> On Thu, Jun 6, 2013 at 4:24 PM, Shaun Clowes <sclowes@atlassian.com>w=
rote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Does anyone know the performance impact the dynamic partitions shoul=
d
>>>>>> be expected to have?
>>>>>>
>>>>>> I have a table that is partitioned by a string in the form 'YYYY-MM'=
.
>>>>>> When I insert in to this table (from an external table that is just =
an S3
>>>>>> bucket containing gzipped logs) using dynamic partitioning I get ver=
y slow
>>>>>> performance with each node in the cluster unable to process more tha=
n 2MB
>>>>>> per second. When I run the exact same query with static partition va=
lues I
>>>>>> get more about 30-40MB/s on each node.
>>>>>>
>>>>>> I've never seen this type of problem with our internal cluster
>>>>>> running Hive 0.7.1 (CDH3u4), but it happens every time in EMR.
>>>>>>
>>>>>> Thanks,
>>>>>> Shaun
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Ted Xu
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ted Xu
>>>
>>
>>
>
>
> --
> Regards,
> Ted Xu
>

--047d7b3a967068fdd404df63b789
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for following up Ted, I couldn&#39;t work out why t=
he progress tracking was being forced on for Dynamic Partition inserts so t=
hanks for your helpful explanation. I&#39;ll raise a JIRA issue regarding t=
he problem. Do you have any idea for an alternate approach? I could have a =
go at implementing a fix but I&#39;m not sure in what a better alternative =
might be.=A0<div>

<br></div><div>In the mean time I&#39;ve implemented a semantic hook that w=
ill remove the counters that are added to the operators. The source is belo=
w encase anyone else finds it useful. To use it:</div><div><br></div><div s=
tyle>

- Start hive with the jar loaded, &quot;export HIVE_AUX_JARS_PATH=3Dpath/to=
/jar&quot;</div><div style>- Add the hook, &quot;set hive.semantic.analyzer=
.hook=3Dcom.atlassian.hive.RemoveCountersHook;&quot;</div><div style><br></=
div>

<div style>Thanks,</div><div style>Shaun</div><div style><br></div><div sty=
le><div><font face=3D"courier new, monospace">/*=A0</font></div><div><font =
face=3D"courier new, monospace">=A0* Remove counters from tasks (this drast=
ically speeds up dynamic=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* partition inserts in Amazon=
 EMR). The counters are enabled because</font></div><div><font face=3D"cour=
ier new, monospace">=A0* hive.task.progress is forced on in SemanticAnalyze=
r.java for=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* dynamic partition insert qu=
eries. Ted Xu explains that this is</font></div><div><font face=3D"courier =
new, monospace">=A0* simply so that the job can be killed if the maximum nu=
mber of</font></div>

<div><font face=3D"courier new, monospace">=A0* dynamic partitions is excee=
ded in the following mailing list=A0</font></div><div><font face=3D"courier=
 new, monospace">=A0* message:</font></div><div><font face=3D"courier new, =
monospace">=A0*</font></div>

<div><font face=3D"courier new, monospace">=A0* <a href=3D"http://mail-arch=
ives.apache.org/mod_mbox/hive-user/201306.mbox/%3CCAP9%2B16xbCfvCc%3DgiKW4a=
9GQ588sZhGYiiKH7DS5CH9nr07i-ug%40mail.gmail.com%3E">http://mail-archives.ap=
ache.org/mod_mbox/hive-user/201306.mbox/%3CCAP9%2B16xbCfvCc%3DgiKW4a9GQ588s=
ZhGYiiKH7DS5CH9nr07i-ug%40mail.gmail.com%3E</a>=A0</font></div>

<div><font face=3D"courier new, monospace">=A0*</font></div><div><font face=
=3D"courier new, monospace">=A0* If you are sure the limit will not be exce=
eded removing the=A0</font></div><div><font face=3D"courier new, monospace"=
>=A0* counters is extremely beneficial.</font></div>

<div><font face=3D"courier new, monospace">=A0*=A0</font></div><div><font f=
ace=3D"courier new, monospace">=A0* Copyright =A9 2013 Atlassian Corporatio=
n Pty Ltd. Licensed=A0</font></div><div><font face=3D"courier new, monospac=
e">=A0* under the Apache License, Version 2.0 (the &quot;License&quot;); yo=
u may not use=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* this file except in complia=
nce with the License. You may obtain a copy of=A0</font></div><div><font fa=
ce=3D"courier new, monospace">=A0* the License at <a href=3D"http://www.apa=
che.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a=
>. Unless=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* required by applicable law =
or agreed to in writing, software distributed=A0</font></div><div><font fac=
e=3D"courier new, monospace">=A0* under the License is distributed on an &q=
uot;AS IS&quot; BASIS, WITHOUT WARRANTIES=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* OR CONDITIONS OF ANY KIND, =
either express or implied. See the License=A0</font></div><div><font face=
=3D"courier new, monospace">=A0* for the specific language governing permis=
sions and limitations under the License.</font></div>

<div><font face=3D"courier new, monospace">=A0*=A0</font></div><div><font f=
ace=3D"courier new, monospace">=A0* There are three types of hooks all of w=
hich are executed in Driver.java</font></div><div><font face=3D"courier new=
, monospace">=A0* BEFORE the map reduce jobs are submitted (i.e they are pr=
ocessed in the=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* Hive driver/CLI):=A0</font>=
</div><div><font face=3D"courier new, monospace">=A0*=A0</font></div><div><=
font face=3D"courier new, monospace">=A0* =A0 - Semantic Analyzer Hooks - D=
eclared with comma separated class=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* =A0 =A0 names in hive.seman=
tic.analyzer.hook. Called both before and=A0</font></div><div><font face=3D=
"courier new, monospace">=A0* =A0 =A0 after semantic analysis is completed.=
 Implements=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* =A0 =A0 AbstractSemanticAna=
lyzerHook</font></div><div><font face=3D"courier new, monospace">=A0* =A0 -=
 Pre hooks - Declared with comma separated class names in=A0</font></div><d=
iv><font face=3D"courier new, monospace">=A0* =A0 =A0 hive.exec.pre.hooks. =
Run after the job has completed, can=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* =A0 =A0 implement ExecuteWi=
thHookContext or PostExecute</font></div><div><font face=3D"courier new, mo=
nospace">=A0* =A0 - Post hooks - Declared with comma separated class names =
in=A0</font></div>

<div><font face=3D"courier new, monospace">=A0* =A0 =A0 hive.exec.post.hook=
s. Run after the job has completed, can=A0</font></div><div><font face=3D"c=
ourier new, monospace">=A0* =A0 =A0 implement ExecuteWithHookContext or Pos=
tExecute</font></div>

<div><font face=3D"courier new, monospace">=A0*/</font></div><div><font fac=
e=3D"courier new, monospace">package com.atlassian.hive;</font></div><div><=
font face=3D"courier new, monospace"><br></font></div><div><font face=3D"co=
urier new, monospace">import java.lang.RuntimeException;</font></div>

<div><font face=3D"courier new, monospace">import java.io.IOException;</fon=
t></div><div><font face=3D"courier new, monospace"><br></font></div><div><f=
ont face=3D"courier new, monospace">import java.io.Serializable;</font></di=
v>

<div><font face=3D"courier new, monospace">import java.util.HashMap;</font>=
</div><div><font face=3D"courier new, monospace">import java.util.List;</fo=
nt></div><div><font face=3D"courier new, monospace">import java.util.ArrayL=
ist;</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">import org.apache.hadoop.hive.ql.exec.Operator.=
ProgressCounter;</font></div><div><font face=3D"courier new, monospace">imp=
ort org.apache.hadoop.hive.ql.exec.DDLTask;</font></div>

<div><font face=3D"courier new, monospace">import org.apache.hadoop.hive.ql=
.exec.Task;</font></div><div><font face=3D"courier new, monospace">import o=
rg.apache.hadoop.hive.ql.parse.ASTNode;</font></div><div><font face=3D"cour=
ier new, monospace">import org.apache.hadoop.hive.ql.parse.AbstractSemantic=
AnalyzerHook;</font></div>

<div><font face=3D"courier new, monospace">import org.apache.hadoop.hive.ql=
.parse.HiveParser;</font></div><div><font face=3D"courier new, monospace">i=
mport org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext;</fon=
t></div>

<div><font face=3D"courier new, monospace">import org.apache.hadoop.hive.ql=
.parse.SemanticException;</font></div><div><font face=3D"courier new, monos=
pace">import org.apache.hadoop.hive.ql.plan.CreateTableDesc;</font></div><d=
iv>

<font face=3D"courier new, monospace"><br></font></div><div><font face=3D"c=
ourier new, monospace">import org.apache.commons.logging.Log;</font></div><=
div><font face=3D"courier new, monospace">import org.apache.commons.logging=
.LogFactory;</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">import org.apache.hadoop.hive.ql.session.Sessio=
nState;</font></div><div><font face=3D"courier new, monospace">import org.a=
pache.hadoop.hive.ql.session.SessionState.LogHelper;</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">import org.apache.hadoop.hive.ql.exec.Operator;=
</font></div><div><font face=3D"courier new, monospace">import org.apache.h=
adoop.hive.ql.exec.ConditionalTask;</font></div>

<div><font face=3D"courier new, monospace">import org.apache.hadoop.hive.ql=
.exec.ExecDriver;</font></div><div><font face=3D"courier new, monospace">im=
port org.apache.hadoop.hive.ql.plan.MapredWork;</font></div><div><font face=
=3D"courier new, monospace"><br>

</font></div><div><font face=3D"courier new, monospace">public class Remove=
CountersHook extends AbstractSemanticAnalyzerHook {</font></div><div><font =
face=3D"courier new, monospace"><br></font></div><div><font face=3D"courier=
 new, monospace">=A0 =A0private static final Log LOG =3D LogFactory.getLog(=
RemoveCountersHook.class);</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">=A0 =A0@Override</font></div><div><font face=3D=
"courier new, monospace">=A0 =A0public ASTNode preAnalyze(HiveSemanticAnaly=
zerHookContext context, ASTNode ast)</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 throws SemanticExcep=
tion {</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 /=
/LogHelper console =3D SessionState.getConsole();</font></div><div><font fa=
ce=3D"courier new, monospace">=A0 =A0 =A0 //console.printInfo(&quot;RemoveC=
ountersHook called for preAnalyze&quot;);</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;Remov=
eCountersHook called for preAnalyze&quot;);</font></div><div><font face=3D"=
courier new, monospace">=A0 =A0 =A0 return ast;</font></div><div><font face=
=3D"courier new, monospace">=A0 =A0}</font></div>

<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace"><br></font></div><div><font face=3D"courier new=
, monospace">=A0 =A0@Override</font></div><div><font face=3D"courier new, m=
onospace">=A0 =A0public void postAnalyze(HiveSemanticAnalyzerHookContext co=
ntext,</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 List&lt;Task&lt;? ex=
tends Serializable&gt;&gt; rootTasks) throws SemanticException {</font></di=
v><div><font face=3D"courier new, monospace">=A0 =A0 =A0 LogHelper console =
=3D SessionState.getConsole();</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;Remov=
eCountersHook called for postAnalyze&quot;);</font></div><div><font face=3D=
"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;Context &quot; + contex=
t);</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;Root =
tasks &quot; + rootTasks);</font></div><div><font face=3D"courier new, mono=
space"><br></font></div><div><font face=3D"courier new, monospace">=A0 =A0 =
=A0 for (Task&lt;? extends Serializable&gt; tsk : rootTasks)</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0removeOperato=
rCountersFromTask(tsk);</font></div><div><font face=3D"courier new, monospa=
ce"><br></font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0=
 return;</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0}</font></div><div><font =
face=3D"courier new, monospace"><br></font></div><div><font face=3D"courier=
 new, monospace">=A0 =A0public void removeOperatorCountersFromTask(Task&lt;=
? extends Serializable&gt; task)</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0{</font></div><div><font =
face=3D"courier new, monospace">=A0 =A0 Operator.resetLastEnumUsed();</font=
></div><div><font face=3D"courier new, monospace">=A0 =A0=A0</font></div><d=
iv><font face=3D"courier new, monospace">=A0 =A0 if (task instanceof ExecDr=
iver) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 HashMap&lt;String, O=
perator&lt;? extends Serializable&gt;&gt; opMap =3D ((MapredWork) task</fon=
t></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 .get=
Work()).getAliasToWork();</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 if (!opMap.isEmpty()=
) {</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
for (Operator&lt;? extends Serializable&gt; op : opMap.values()) {</font></=
div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 removeOp=
eratorCountersFromOp(task, op);</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 }</font></div><d=
iv><font face=3D"courier new, monospace">=A0 =A0 =A0 }</font></div><div><fo=
nt face=3D"courier new, monospace"><br></font></div><div><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 Operator&lt;? extends Serializable&gt; redu=
cer =3D ((MapredWork) task.getWork())</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 .getReducer(=
);</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 if (r=
educer !=3D null) {</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 removeOperatorCountersFromOp(task, reducer);</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 }</font></div><div><=
font face=3D"courier new, monospace">=A0 =A0 } else if (task instanceof Con=
ditionalTask) {</font></div><div><font face=3D"courier new, monospace">=A0 =
=A0 =A0 List&lt;Task&lt;? extends Serializable&gt;&gt; listTasks =3D ((Cond=
itionalTask) task)</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 .getListTask=
s();</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 for=
 (Task&lt;? extends Serializable&gt; tsk : listTasks) {</font></div><div><f=
ont face=3D"courier new, monospace">=A0 =A0 =A0 =A0 removeOperatorCountersF=
romTask(tsk);</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 }</font></div><div><=
font face=3D"courier new, monospace">=A0 =A0 }</font></div><div><font face=
=3D"courier new, monospace"><br></font></div><div><font face=3D"courier new=
, monospace">=A0 =A0 if (task.getChildTasks() =3D=3D null) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 return;</font></div>=
<div><font face=3D"courier new, monospace">=A0 =A0 }</font></div><div><font=
 face=3D"courier new, monospace"><br></font></div><div><font face=3D"courie=
r new, monospace">=A0 =A0 for (Task&lt;? extends Serializable&gt; childTask=
 : task.getChildTasks()) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 removeOperatorCounte=
rsFromTask(childTask);</font></div><div><font face=3D"courier new, monospac=
e">=A0 =A0 }</font></div><div><font face=3D"courier new, monospace">=A0 =A0=
}</font></div><div>

<font face=3D"courier new, monospace"><br></font></div><div><font face=3D"c=
ourier new, monospace">=A0 private void removeOperatorCountersFromOp(Task&l=
t;? extends Serializable&gt; task, Operator&lt;? extends Serializable&gt; o=
p) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 HashMap&lt;String, Progr=
essCounter&gt; counterNameToEnum =3D op.getCounterNameToEnum();</font></div=
><div><font face=3D"courier new, monospace">=A0 =A0 if (counterNameToEnum =
=3D=3D null || counterNameToEnum.size() =3D=3D 0)</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;No co=
unters to remove from operator &quot; + op);</font></div><div><font face=3D=
"courier new, monospace">=A0 =A0 else</font></div><div><font face=3D"courie=
r new, monospace">=A0 =A0 {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 LOG.info(&quot;Remov=
ing &quot; + counterNameToEnum.size() + &quot; counters from operator &quot=
; + op + &quot; in task &quot; + task);</font></div><div><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 // op.setCounterNames(new ArrayList&lt;Stri=
ng&gt;());</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 op.setCounterNameToE=
num(null);</font></div><div><font face=3D"courier new, monospace">=A0 =A0 }=
</font></div><div><font face=3D"courier new, monospace">=A0 =A0=A0</font></=
div><div><font face=3D"courier new, monospace">=A0 =A0 if (op.getChildOpera=
tors() =3D=3D null) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 return;</font></div>=
<div><font face=3D"courier new, monospace">=A0 =A0 }</font></div><div><font=
 face=3D"courier new, monospace"><br></font></div><div><font face=3D"courie=
r new, monospace">=A0 =A0 for (Operator&lt;? extends Serializable&gt; child=
 : op.getChildOperators()) {</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0 =A0 removeOperatorCounte=
rsFromOp(task, child);</font></div><div><font face=3D"courier new, monospac=
e">=A0 =A0 }</font></div><div><font face=3D"courier new, monospace">=A0 }</=
font></div><div>

<font face=3D"courier new, monospace">}</font></div><div><br></div></div><d=
iv><div><br></div></div></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On 17 June 2013 19:48, Ted Xu <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:txu@gopivotal.com" target=3D"_blank">txu@gopivotal.com</a>&gt;<=
/span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Shaun,<div><br></div><di=
v>Your findings are valid. Hive uses Hadoop job counters to report fatal er=
ror, so the client can kill the MapReduce job before it completes.</div>

<div><br></div><div>
With regard to your case, because Hive wants to kill the MapReduce job when=
 there is too many partitions using Dynamic Partitioning, counters report i=
s forced to enable. IMHO, fatal error report should not depend on the &quot=
;job progress&quot; switch. You can file a JIRA ticket on this one.</div>


</div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><br><div class=
=3D"gmail_quote">On Fri, Jun 7, 2013 at 1:55 PM, Shaun Clowes <span dir=3D"=
ltr">&lt;<a href=3D"mailto:sclowes@atlassian.com" target=3D"_blank">sclowes=
@atlassian.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Ted, All,=A0<div><br></d=
iv><div>Unfortunately profiling turns out to be extremely slow, so it&#39;s=
 not very fruitful for determining what&#39;s going on here.</div>


<div>=A0</div><div>On the other hand I seem to have traced this problem dow=
n to the &quot;hive.task.progress&quot; configuration variable. When this i=
s set to true (as it is automatically when a dynamic partition insert it us=
ed), the insert is drastically slower than it is otherwise.=A0</div>


<div><br></div><div>In SemanticAnalyzer.java it forces this task tracking o=
n as follows:</div><div><br></div><div><div>=A0 =A0 =A0 =A0 =A0 // turn on =
hive.task.progress to update # of partitions created to the JT</div>

<div>=A0 =A0 =A0 =A0 =A0 HiveConf.setBoolVar(conf, HiveConf.ConfVars.HIVEJO=
BPROGRESS, true);</div><div><br></div><div>Does anyone know why this must b=
e turned on? What is the need for the number of partitions created to be re=
ported? The end result is a lot more than just the number of partitions hav=
ing their statistics reported.</div>


<div><br></div><div>I&#39;m not sure why the insert is so very slow when it=
&#39;s on, perhaps the retrieval of the current time in millis in Operator.=
java:</div><div><br></div><div><div>1076 =A0 /**</div>

<div>1077 =A0 =A0* this is called after operator process to buffer some cou=
nters.</div><div>1078 =A0 =A0*/</div><div>1079 =A0 private void postProcess=
Counter() {</div><div>1080 =A0 =A0 if (counterNameToEnum !=3D null) {</div>=
<div>1081 =A0 =A0 =A0 totalTime +=3D (System.currentTimeMillis() - beginTim=
e);</div>


<div>1082 =A0 =A0 }</div><div>1083 =A0 }</div><div><br></div><div>Thanks,</=
div><div>Shaun</div></div></div>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On 6 June 2013 19:00, Ted Xu <span dir=3D"ltr">&lt;<a href=3D"mailto:txu=
@gopivotal.com" target=3D"_blank">txu@gopivotal.com</a>&gt;</span> wrote:<b=
r>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">

<div dir=3D"ltr">Hi Shaun,<div><br></div><div>This is weird. I&#39;m not su=
re if there is any other reasons (e.g., a very complex UDF?) caused this is=
sue, but it would be the best if you can do a <a href=3D"http://hadoop.apac=
he.org/docs/stable/mapred_tutorial.html#Profiling" target=3D"_blank">profil=
ing</a>, see if there is hot spot.=A0</div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Thu, Jun 6, 2013 at 4:38 PM, Shaun Clowes <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:sclowes@atlassian.com" target=3D"_blank">sclowes@atlassian.com=
</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi Ted,</div><div><br>=
</div>It&#39;s actually just one partition being created which is what make=
s it so weird.=A0<div>


<br></div><div>Thanks,</div><div>Shaun</div></div><div><div><div class=3D"g=
mail_extra">

<br><br><div class=3D"gmail_quote">On 6 June 2013 18:36, Ted Xu <span dir=
=3D"ltr">&lt;<a href=3D"mailto:txu@gopivotal.com" target=3D"_blank">txu@gop=
ivotal.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir=3D"ltr">Hi Shaun,<div><br></div><div>Too many partitions in dynami=
c partitioning may slow down the mapreduce job. Can you estimate how many p=
artitions will be generated after insert?</div></div><div class=3D"gmail_ex=
tra">


<div><div>
<br><br><div class=3D"gmail_quote">On Thu, Jun 6, 2013 at 4:24 PM, Shaun Cl=
owes <span dir=3D"ltr">&lt;<a href=3D"mailto:sclowes@atlassian.com" target=
=3D"_blank">sclowes@atlassian.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">


<div dir=3D"ltr">Hi All,<div><br></div><div>Does anyone know the performanc=
e impact the dynamic partitions should be expected to have?</div><div><br><=
/div><div>I have a table that is partitioned by a string in the form &#39;Y=
YYY-MM&#39;. When I insert in to this table (from an external table that is=
 just an S3 bucket containing gzipped logs) using dynamic partitioning I ge=
t very slow performance with each node in the cluster unable to process mor=
e than 2MB per second. When I run the exact same query with static partitio=
n values I get more about 30-40MB/s on each node.=A0</div>


<div><br></div><div>I&#39;ve never seen this type of problem with our inter=
nal cluster running Hive 0.7.1 (CDH3u4), but it happens every time in EMR.=
=A0</div><div><br></div><div>Thanks,</div><div>

Shaun</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br><div dir=3D"ltr">Regards,<div><div>Ted Xu</di=
v></div></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br><div dir=3D"ltr">Regards,<div><di=
v>Ted Xu</div></div></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div dir=3D"ltr"=
>Regards,<div><div>Ted Xu</div></div></div>
</font></span></div>
</blockquote></div><br></div>

--047d7b3a967068fdd404df63b789--