hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-10458) Enable parallel order by for spark [Spark Branch]
Date Mon, 18 May 2015 18:53:01 GMT


Xuefu Zhang commented on HIVE-10458:

1. I think we should let hive.optimize.sampling.orderby to control parallel orderby for spark.
2. As to implementation, we have two choices:
   a1) Use Spark's orderByKey transformation, as what your patch #3 is doing. In this approach,
Spark will do sampling and key partitioning. 
   a2) Use Hive's approach. Hive will do sampling and set up a partitioner, and we will use
Spark's transformation, repartitionAndSortWithinPartitions, and use that partitioner. (Currently
this transformation is used in a different context, with a hash partitioner.)

Both approaches are acceptable to me. Approach a1 seems simpler with less code to write, but
more tied to Spark. I'm not sure of performance difference. It would be great to measure it,
but not critical at this moment.

If we take approache a1, we need to make sure that we are not doing double sampling. That
is, we need to make sure that MR's sampler and total order partitioner are turned off for

> Enable parallel order by for spark [Spark Branch]
> -------------------------------------------------
>                 Key: HIVE-10458
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-10458.1-spark.patch, HIVE-10458.2-spark.patch, HIVE-10458.3-spark.patch
> We don't have to force reducer# to 1 as spark supports parallel sorting.

This message was sent by Atlassian JIRA

View raw message