hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark
Date Wed, 11 May 2016 15:49:12 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280312#comment-15280312
] 

Xuefu Zhang commented on HIVE-13293:
------------------------------------

[~lirui], thanks for working on this. The patch looks good, but one thing I'm not very sure
of is the persistence level. Order by is almost always at the end of stages. Thus, does it
make sense to have a mixed of memory and disk?

As a side, out of scope question, do we need to explicitly call rdd.unpersist() for those
cached rdds once a query is completed? Right now, rdds are never reused across queries.

> Query occurs performance degradation after enabling parallel order by for Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>         Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found query 10
has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message