hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdinand Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory
Date Wed, 18 Jan 2017 04:04:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827399#comment-15827399
] 

Ferdinand Xu commented on HIVE-15580:
-------------------------------------

Hi [~xuefuz], the main change is about replacing *groupByKey* with *repartitionAndSortWithinPartitions*.
Just help me to have a better understand. 
Before this patch:
e.g. GroupByShuffle will lead to the following result:
K1 -> iterator of {V11,V12,V13...}
K2 -> iterator of {V21,V22,V23...}
...

With this patch:
K1 -> V11
K1 -> V12
K1 -> V13
...
K2 -> V21
...

And we process them one by one without fetching the value from iterator. If so, is there any
side effect by changing this?


> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, HIVE-15580.2.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message