spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: Skip Corrupted Parquet blocks / footer.
Date Wed, 04 Jan 2017 08:06:43 GMT

Hi,

The method readAllFootersInParallel is implemented in Parquet's
ParquetFileReader. So the spark config "spark.sql.files.ignoreCorruptFiles"
doesn't work for it.

Reading all footers in parallel can speed up the task. However, we can't
control if ignoring corrupt files or not.

Of course we can read this footers in sequence and ignore the corrupt ones.
But it might be inefficient. Since this is a relatively corner use case, I
don't expect we can have this.

Maybe Parquet can implement an option to ignore corrupt files. However, even
so, it can't be expected to have this updated Parquet implementation
available to Spark very soon.



khyati wrote
> Hi Reynold Xin,
> 
> In spark 2.1.0,
> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
> commands,
> 
> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
> 
> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
> 
> but still getting error while reading parquet files using 
> val newDataDF =
> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")
> 
> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> java.io.IOException: Could not read footer: java.lang.RuntimeException:
> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
> [65, 82, 49, 10]
> 	at
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
> 
> 
> Please let me know if I am missing anything.





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20450.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message