spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ameen Tayyebi (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
Date Thu, 15 Feb 2018 21:05:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ameen Tayyebi resolved SPARK-22913.
-----------------------------------
    Resolution: Won't Fix

Resolving in favor of native Glue integration. These advanced predicates can't be supported
because the version of Hive embedded in Spark does not support them.

> Hive Partition Pruning, Fractional and Timestamp types
> ------------------------------------------------------
>
>                 Key: SPARK-22913
>                 URL: https://issues.apache.org/jira/browse/SPARK-22913
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ameen Tayyebi
>            Priority: Major
>
> Spark currently pushes the predicates it has in the SQL query to Hive Metastore. This
only applies to predicates that are placed on top of partitioning columns. As more and more
hive metastore implementations come around, this is an important optimization to allow data
to be prefiltered to only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above query. The reason
is that the code that tries to compute predicates to be sent to hive metastore, only deals
with integral and string column types. It doesn't know how to handle fractional and timestamp
columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type timestamp
and double. In my specific case, it takes Spark's master node about 6.5 minutes to download
all partitions for the table, and then filter the partitions client-side. The actual processing
time of my query is only 6 seconds. In other words, without partition pruning, I'm looking
at 6.5 minutes of processing and with partition pruning, I'm looking at 6 seconds only.
> I have a fix for this developed locally that I'll provide shortly as a pull request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message