Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6514E17570 for ; Tue, 17 Mar 2015 13:32:38 +0000 (UTC) Received: (qmail 97839 invoked by uid 500); 17 Mar 2015 13:32:38 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 97817 invoked by uid 500); 17 Mar 2015 13:32:38 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 97807 invoked by uid 99); 17 Mar 2015 13:32:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2015 13:32:38 +0000 Date: Tue, 17 Mar 2015 13:32:38 +0000 (UTC) From: "Rui Li (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-9697) Hive on Spark is not as aggressive as MR on map join [Spark Branch] MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-9697?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14365= 102#comment-14365102 ]=20 Rui Li commented on HIVE-9697: ------------------------------ [~csun] - I think MR doesn't use rawDataSize even when it's available. Seem= s it just uses ContentSummary. > Hive on Spark is not as aggressive as MR on map join [Spark Branch] > ------------------------------------------------------------------- > > Key: HIVE-9697 > URL: https://issues.apache.org/jira/browse/HIVE-9697 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xin Hao > > We have a finding during running some Big-Bench cases: > when the same small table size threshold is used, Map Join operator will = not be generated in Stage Plans for Hive on Spark, while will be generated = for Hive on MR. > For example, When we run BigBench Q25, the meta info of one input ORC tab= le is as below: > totalSize=3D1748955 (about 1.5M) > rawDataSize=3D123050375 (about 120M) > If we use the following parameter settings, > set hive.auto.convert.join=3Dtrue; > set hive.mapjoin.smalltable.filesize=3D25000000; > set hive.auto.convert.join.noconditionaltask=3Dtrue; > set hive.auto.convert.join.noconditionaltask.size=3D100000000; (100M) > Map Join will be enabled for Hive on MR mode, while will not be enabled f= or Hive on Spark. > We found that for Hive on MR, the HDFS file size for the table (ContentSu= mmary.getLength(), should approximate the value of =E2=80=98totalSize=E2=80= =99) will be used to compare with the threshold 100M (smaller than 100M), w= hile for Hive on Spark 'rawDataSize' will be used to compare with the thres= hold 100M (larger than 100M). That's why MapJoin is not enabled for Hive on= Spark for this case. And as a result Hive on Spark will get much lower per= formance data than Hive on MR for this case. > When we set hive.auto.convert.join.noconditionaltask.size=3D150000000; (= 150M), MapJoin will be enabled for Hive on Spark mode, and Hive on Spark wi= ll have similar performance data with Hive on MR by then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)