spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: Skip Corrupted Parquet blocks / footer.
Date Wed, 04 Jan 2017 08:12:18 GMT


Forget to say, another option is we can replace readAllFootersInParallel
with our parallel reading logic, so we can ignore corrupt files.


Liang-Chi Hsieh wrote
> Hi,
> 
> The method readAllFootersInParallel is implemented in Parquet's
> ParquetFileReader. So the spark config
> "spark.sql.files.ignoreCorruptFiles" doesn't work for it.
> 
> Reading all footers in parallel can speed up the task. However, we can't
> control if ignoring corrupt files or not.
> 
> Of course we can read this footers in sequence and ignore the corrupt
> ones. But it might be inefficient. Since this is a relatively corner use
> case, I don't expect we can have this.
> 
> Maybe Parquet can implement an option to ignore corrupt files. However,
> even so, it can't be expected to have this updated Parquet implementation
> available to Spark very soon.
> 
> khyati wrote
>> Hi Reynold Xin,
>> 
>> In spark 2.1.0,
>> I tried setting spark.sql.files.ignoreCorruptFiles = true by using
>> commands,
>> 
>> val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc)
>> 
>> sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") /
>> sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true")
>> 
>> but still getting error while reading parquet files using 
>> val newDataDF =
>> sqlContext.read.parquet("/data/tempparquetdata/corruptblock.0","/data/tempparquetdata/data1.parquet")
>> 
>> Error: ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID
>> 4)
>> java.io.IOException: Could not read footer: java.lang.RuntimeException:
>> hdfs://192.168.1.53:9000/data/tempparquetdata/corruptblock.0 is not a
>> Parquet file. expected magic number at tail [80, 65, 82, 49] but found
>> [65, 82, 49, 10]
>> 	at
>> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248)
>> 
>> 
>> Please let me know if I am missing anything.





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tp20418p20451.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message