Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Tue, 17 Mar 2015 13:32:38 +0000 (UTC)
From: "Rui Li (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.12775191.1423983665000.117010.1426599158159@Atlassian.JIRA>
In-Reply-To: <JIRA.12775191.1423983665000@Atlassian.JIRA>
References: <JIRA.12775191.1423983665000@Atlassian.JIRA>
 <JIRA.12775191.1423983665988@arcas>
Subject: [jira] [Commented] (HIVE-9697) Hive on Spark is not as aggressive
 as MR on map join [Spark Branch]
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HIVE-9697?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14365=
102#comment-14365102 ]=20

Rui Li commented on HIVE-9697:
------------------------------

[~csun] - I think MR doesn't use rawDataSize even when it's available. Seem=
s it just uses ContentSummary.

> Hive on Spark is not as aggressive as MR on map join [Spark Branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-9697
>                 URL: https://issues.apache.org/jira/browse/HIVE-9697
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xin Hao
>
> We have a finding during running some Big-Bench cases:
> when the same small table size threshold is used, Map Join operator will =
not be generated in Stage Plans for Hive on Spark, while will be generated =
for Hive on MR.
> For example, When we run BigBench Q25, the meta info of one input ORC tab=
le is as below:
>     totalSize=3D1748955 (about 1.5M)
>     rawDataSize=3D123050375 (about 120M)
> If we use the following parameter settings,
>     set hive.auto.convert.join=3Dtrue;
>     set hive.mapjoin.smalltable.filesize=3D25000000;
>     set hive.auto.convert.join.noconditionaltask=3Dtrue;
>     set hive.auto.convert.join.noconditionaltask.size=3D100000000; (100M)
> Map Join will be enabled for Hive on MR mode, while will not be enabled f=
or Hive on Spark.
> We found that for Hive on MR, the HDFS file size for the table (ContentSu=
mmary.getLength(), should approximate the value of =E2=80=98totalSize=E2=80=
=99) will be used to compare with the threshold 100M (smaller than 100M), w=
hile for Hive on Spark 'rawDataSize' will be used to compare with the thres=
hold 100M (larger than 100M). That's why MapJoin is not enabled for Hive on=
 Spark for this case. And as a result Hive on Spark will get much lower per=
formance data than Hive on MR for this case.
> When we set  hive.auto.convert.join.noconditionaltask.size=3D150000000; (=
150M), MapJoin will be enabled for Hive on Spark mode, and Hive on Spark wi=
ll have similar performance data with Hive on MR by then.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)