spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krish (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
Date Wed, 15 Jul 2020 03:44:00 GMT
Krish created SPARK-32317:
-----------------------------

             Summary: Parquet file loading with different schema(Decimal(N, P)) in files is
not working as expected
                 Key: SPARK-32317
                 URL: https://issues.apache.org/jira/browse/SPARK-32317
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.0
         Environment: Its failing in all environments that I tried.
            Reporter: Krish


Hi,

 

We generate parquet files which are partitioned on Date on a daily basis, and we send updates
to historical data some times, what we noticed is due to some configuration error the patch
data schema is inconsistent to earlier files.

Assuming we had files generated with schema having ID and Amount as fields. Historical data
is having schema like ID INT, AMOUNT DECIMAL(15,6) and the files we send as updates has schema
like DECIMAL(15,2). 

 

Having two different schema in a Date partition and when we load the data of a Date into spark, it
is loading the data but the amount is getting manipulated.

 

file1.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,6)
Content:
1,19500.00
2,198.34

file2.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,2)
Content:
1,19500.00
3,198.34

Load these two files togeather

df3 = spark.read.parquet("output/")

df3.show() #-we can see amount getting manipulated here,

+---+-----------------+
|ID|       AMOUNT|
+---+-----------------+
| 1|        1.950000|
| 3|        0.019834|
| 1|19500.000000|
| 2|    198.340000|
+---+-----------------+

 

Options Tried:

We tried to give schema as String for all fields, but that didt work

df3 = spark.read.format("parquet").schema(schema).load("output/")

Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be
converted in file file*****.snappy.parquet. Column: [AMOUNT], Expected: string, Found: INT64"

 

I know merge schema works if it finds few extra columns in one file but the fileds which are
in common needs to have same schema. That might nort work here.

 

Looking for some work around solution here. Or if there is an option which I havent tried
you can point me to that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message