spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Duin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-22247) Hive partition filter very slow
Date Wed, 11 Oct 2017 09:13:01 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199996#comment-16199996
] 

Patrick Duin edited comment on SPARK-22247 at 10/11/17 9:12 AM:
----------------------------------------------------------------

I think this is irrelevant of the storage format. I'm talking about partition metadata coming
from the Hive metastore. This is all before an actual data crunching job is launched.

Just to be clear: in both examples the actual running job correctly only processes partitions
requested. 

This is about optimizing fetching of metadata.


was (Author: patduin):
I think this is irrelevant of the storage format. I'm talking about partition metadata coming
from the Hive metastore. This is all before an actual data crunching job is launched.

> Hive partition filter very slow
> -------------------------------
>
>                 Key: SPARK-22247
>                 URL: https://issues.apache.org/jira/browse/SPARK-22247
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.2, 2.1.1
>            Reporter: Patrick Duin
>            Priority: Minor
>
> I found an issue where filtering partitions using a dataframe results in very bad performance.
> To reproduce:
> Create a hive table with a lot of partitions and write a spark query on that table that
filters based on the partition column.
> In my use case I've got a table with about 30k partitions. 
> I filter the partitions using some scala via spark-shell:
> {{table.filter("partition=x or partition=y")}}
> This results in a Hive thrift API call:{{ #get_partitions('db', 'table', -1)}} which
is very slow (minutes) and loads all metastore partitions in memory.
> Doing a more simple filter:
> {{table.filter("partition=x)}} 
> Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 'partition
= "x', -1)}} which is very fast (seconds) and only fetches partition X into memory.
> If possible Spark should translate the filter into the more performant Thrift call or
fallback to a more scalable solution where it filters our partitions without having to loading
them all into memory first (for instance fetching the partitions in batches).
> I've posted my original question on [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message