hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory
Date Wed, 18 Jan 2017 14:15:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828143#comment-15828143
] 

Rui Li commented on HIVE-15580:
-------------------------------

Hi [~xuefuz], I'd like to check my understanding too. Before the patch, we have 3 kinds of
shuffle: groupByKey, sortByKey and repartitionAndSortWithinPartitions. For the last two, we
do the grouping ourselves (because reducer expects <Key, Iterator<Value>>). This
grouping uses unbounded memory, which is the root cause of HIVE-15527.

With the patch, we'll replace groupByKey with repartitionAndSortWithinPartitions. And we don't
have to do the grouping ourselves because GBY operator will do that for us. Is this correct?
BTW, is there any doc indicating Spark's groupByKey uses unbounded memory? I think Spark can
spill the shuffled data to disk if it's too large.

> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, HIVE-15580.2.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message