spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dilipbiswal <...@git.apache.org>
Subject [GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...
Date Sun, 11 Feb 2018 22:51:00 GMT
GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/20579

    [SPARK-23372][SQL] Writing empty struct in parquet fails during execution. It should fail
earlier in the processing.

    ## What changes were proposed in this pull request?
    Running
    spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
    Results in
    ``` SQL
    org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty
group: message spark_schema {
     }
    
    at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
     at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
     at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
     at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
     at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
     at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
     at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
     at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
     at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
     at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
     at org.apache.spark.scheduler.Task.run(Task.scala:109)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.
    ```
    
    This PR addresses a couple of things.
    1) The above case now fails earlier during processing during the prep write phase.
    2) Writing an empty data frame in ORC succeeds but fails during read while inferring the
schema.
        This issue is also addressed in this PR.
    
    ## How was this patch tested?
    
    Unit tests added in FileBasedDatasourceSuite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark spark-23372

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20579
    
----
commit 9f7a1705960250cf6a828787f0f12a9f28b608c5
Author: Dilip Biswal <dbiswal@...>
Date:   2018-02-11T17:09:07Z

    [SPARK-23372] Writing empty struct in parquet fails during execution. It should fail earlier
in the processing

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message