hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory
Date Wed, 18 Jan 2017 05:59:26 GMT


Xuefu Zhang commented on HIVE-15580:

[~Ferd], Functionally, I don't see anything bad because groupByKey was used in Hive for aggregation.
Hive's groupby operator is able to process one row at a time with this patch. Performance
wise, I'm not sure if this will improve or degrade. That depends on the performance difference
of groupByKey() + value iterator and repartitionAndSortWithinPartitions() + dummy value iterator.
It would be great if you guys can find out.

The obvious benefit of this change is that Hive on Spark overcomes the unbounded memory usage
of groupByKey(). The patch also solves the problem in HIVE-15527.

Please note that this patch is WIP. We will improve it, for example getting ride of the dummy
value iterator created per row.

I manually ran all spark tests with this patch, and there was only one test failure which
needs investigation.

> Replace Spark's groupByKey operator with something with bounded memory
> ----------------------------------------------------------------------
>                 Key: HIVE-15580
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, HIVE-15580.2.patch, HIVE-15580.patch

This message was sent by Atlassian JIRA

View raw message