spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Li (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet
Date Sun, 26 Aug 2018 22:46:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593041#comment-16593041
] 

Xiao Li edited comment on SPARK-25206 at 8/26/18 10:45 PM:
-----------------------------------------------------------

Currently, we do not have a good test coverage when the physical schema and logical schema
use difference cases. Thus, any new change could introduce new behavior changes or bugs. Thus,
the first step is to add the tests first. [~yucai] Could you help this effort?

Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe we do not
need to claim we support this scenario before Spark 2.4?


was (Author: smilegator):
Previously, we do not have a good test coverage when the physical schema and logical schema
use difference cases. Thus, any new change could introduce new behavior changes or bugs. Thus,
the first step is to add the tests first. [~yucai] Could you help this effort?

Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe we do not
need to claim we support this scenario before Spark 2.4?

> Wrong data may be returned for Parquet
> --------------------------------------
>
>                 Key: SPARK-25206
>                 URL: https://issues.apache.org/jira/browse/SPARK-25206
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.2, 2.3.1
>            Reporter: yucai
>            Priority: Blocker
>              Labels: correctness
>         Attachments: image-2018-08-24-18-05-23-485.png, image-2018-08-24-22-33-03-231.png,
image-2018-08-24-22-34-11-539.png, image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png,
image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> +----+
> |  ID|
> +----+
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> +----+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> +--------------------+-----+
> |                 key|value|
> +--------------------+-----+
> |spark.sql.parquet...| true|
> +--------------------+-----+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> +--------------------+-----+
> |                 key|value|
> +--------------------+-----+
> |spark.sql.parquet...|false|
> +--------------------+-----+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff0000}ID{color}"), 0: Integer) into
parquet, but {color:#ff0000}ID{color} does not exist in /tmp/data (parquet is case sensitive,
it has {color:#ff0000}id{color} actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the
pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message