hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr
Date Fri, 16 Jun 2017 15:00:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052006#comment-16052006
] 

Rui Li commented on HIVE-15104:
-------------------------------

The approach here can cause problem when we cache RDDs, e.g. combining equivalent works. The
cached RDDs will be serialized when stored to disk or transferred via network, then we need
the hash code after the data is deserialized. I think we have to ser/de the hash code anyway
to be safe.

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>         Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, HIVE-15104.3.patch, TPC-H
100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size of shuffle
data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive on spark
which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message