spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuming Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
Date Wed, 05 Sep 2018 05:33:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603937#comment-16603937
] 

Yuming Wang commented on SPARK-19145:
-------------------------------------

How about we target this ticket and [SPARK-25039|https://issues.apache.org/jira/browse/SPARK-25039]
to Spark 3.0.0?

> Timestamp to String casting is slowing the query significantly
> --------------------------------------------------------------
>
>                 Key: SPARK-19145
>                 URL: https://issues.apache.org/jira/browse/SPARK-19145
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: gagan taneja
>            Priority: Major
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
LIMIT 50000
> After investigation i found that in the first query time colum is cast to String before
applying the filter 
> However in the second query no such casting is performed and its a filter with long value

> Below are the generate Physical plan for slower execution followed by physical plan for
faster execution 
> SELECT COUNT(*) AS `count`
>                FROM `default`.`table`
>                WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3339L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >=
2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
>                +- *FileScan parquet default.cstat[time#3314] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters:
[], PushedFilters: [IsNotNull(time)], ReadSchema: struct<time:timestamp>
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800')
LIMIT 50000
> == Physical Plan ==
> CollectLimit 50000
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>    +- Exchange SinglePartition
>       +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3287L])
>          +- *Project
>             +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000))
&& (time#3262 <= 1484009631000000))
>                +- *FileScan parquet default.cstat[time#3262] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters:
[], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09...,
ReadSchema: struct<time:timestamp>
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp during generation
of Optimized Logical Plan so that both the query would have similar performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message