hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Sherman (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by
Date Tue, 03 Oct 2017 00:23:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Sherman reassigned HIVE-17677:
-------------------------------------


> Investigate using hive statistics information to optimize HoS parallel order by
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-17677
>                 URL: https://issues.apache.org/jira/browse/HIVE-17677
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Andrew Sherman
>            Assignee: Andrew Sherman
>
> I think Spark's native parallel order by works in a similar way to what we do for Hive-on-MR.
 That is, it scans the RDD once and sample the data to determine what ranges the data should
be partitioned into, and then scans the RDD again to do the actual order by (with multiple
reducers). 
> One optimization suggested by [~stakiar] is that if we have column stats about the col
we are ordering by, then the first scan on the RDD is not necessary. If we have histogram
data about the RDD, we already know what the ranges of the order by should be. This should
work when running parallel order by on simple tables, will be harder when we run it on derived
datasets (although not impossible). 
> To do his we would have to understand more about the internals of JavaPairRDD. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message