hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-9697) Hive on Spark is not as aggressive as MR on map join [Spark Branch]
Date Tue, 17 Mar 2015 04:29:38 GMT


Xuefu Zhang commented on HIVE-9697:

[~lirui], I don't think we had a closure on this. totalSize is closer to file size, while
rawDataSize closer to memory size required. While using totalSize is more aggressive in taking
map join, some file format, such as ORC/Parquet, is very good at compression (10x is comment).
Thus, if whether to do map join is based on file size, the executor can run OOM. On the other
hand, rawDateSize is more conservative on memory estimation, which also gives less opportunity
for map-join.

I'm not sure which one is better for Hive on Spark. File size is what
implies and what user can see, while rawDataSize is closer to memory required. However, once
OOM happens, user gets no result. It's worse than a result that comes slower, right?

Any thoughts?

> Hive on Spark is not as aggressive as MR on map join [Spark Branch]
> -------------------------------------------------------------------
>                 Key: HIVE-9697
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xin Hao
> We have a finding during running some Big-Bench cases:
> when the same small table size threshold is used, Map Join operator will not be generated
in Stage Plans for Hive on Spark, while will be generated for Hive on MR.
> For example, When we run BigBench Q25, the meta info of one input ORC table is as below:
>     totalSize=1748955 (about 1.5M)
>     rawDataSize=123050375 (about 120M)
> If we use the following parameter settings,
>     set;
>     set hive.mapjoin.smalltable.filesize=25000000;
>     set;
>     set; (100M)
> Map Join will be enabled for Hive on MR mode, while will not be enabled for Hive on Spark.
> We found that for Hive on MR, the HDFS file size for the table (ContentSummary.getLength(),
should approximate the value of ‘totalSize’) will be used to compare with the threshold
100M (smaller than 100M), while for Hive on Spark 'rawDataSize' will be used to compare with
the threshold 100M (larger than 100M). That's why MapJoin is not enabled for Hive on Spark
for this case. And as a result Hive on Spark will get much lower performance data than Hive
on MR for this case.
> When we set; (150M), MapJoin
will be enabled for Hive on Spark mode, and Hive on Spark will have similar performance data
with Hive on MR by then.

This message was sent by Atlassian JIRA

View raw message