hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhang Xiaoyu <zhangxiaoyu...@gmail.com>
Subject Re: how to skip single corrupted SequenceFile in SequenceFileInputFormat ?
Date Thu, 11 Jun 2015 19:30:45 GMT
sorry, just add on top of that, it will fail in two condition

1. the file flush something to it, with header and some data

2. the file flush nothing to it, so when vim the file, it is complete empty

it will fail in both cases with the same exception, so looks like treat
them both as corrupted file, the question is is there a way to skip those
individual in the input format ? as my input is a folder, and there are
many files, some are corrupted, some are not, but at least it shouldn't
fail the MR job just because of a single file

Thanks,
Johnny

On Thu, Jun 11, 2015 at 12:15 PM, Zhang Xiaoyu <zhangxiaoyu912@gmail.com>
wrote:

> Hi, all,
> My MR job (consumer pipeline) is using SequenceFileInputFormat as as the
> input format in the MultipleInputs
>
> for (FileStatus input : inputs) {
>     MultipleInputs.addInputPath(job, myPath, SequenceFileInputFormat.class, MyMapper.class);
> }
>
>
> My application will fail in condition that, when generator (use
> SequenceFile.Writer) just create a zero size file, and keep append k-v to
> it, but the content is not big enough so that nothing is flushed to the
> file yet (even no blocking is generated), at this moment if the consumer
> pipeline program kicked off and consume the file, it will treat it as a
> corruption file with exception
>
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> ~[na:1.7.0_60-ea]
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> ~[na:1.7.0_60-ea]
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
> ~[hadoop-common-2.2.0.jar:na]
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
> [hadoop-common-2.2.0.jar:na]
> at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1832)
> [hadoop-common-2.2.0.jar:na]
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1752)
> [hadoop-common-2.2.0.jar:na]
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
> [hadoop-common-2.2.0.jar:na]
> at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
> [hadoop-mapreduce-client-core-2.2.0.jar:na]
> at
> org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
> [hadoop-mapreduce-client-core-2.2.0.jar:na]
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
> [hadoop-mapreduce-client-core-2.2.0.jar:na]
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
> [hadoop-mapreduce-client-core-2.2.0.jar:na]
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
> [hadoop-mapreduce-client-core-2.2.0.jar:na]
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
> [hadoop-mapreduce-client-common-2.2.0.jar:na]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> [na:1.7.0_60-ea]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [na:1.7.0_60-ea]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_60-ea]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_60-ea]
> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_60-ea]
>
> all the code is controlled in lib class, so there is not much thing I can
> do in my MR job. So is there a way to skip a single *corrupted*
> SequenceFile ?
>
> another thing is when the program fail, and when I open vim the input
> file, I found the file SEEMS has the proper header (SEQ, size, and etc..),
> so not sure which part is corrected, maybe it is just timing, means when
> the read happen, it doesn't has those header yet.
>
> NOT SURE this will help but here is the header (plus a little bit content
> maybe) of the "corrupted" file:
>
> SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable
> ^@^@^@^@^@^@ù<9a>ñ> <æfá#¬6<94>I­Ç^@^@^@<8c>^@^@^@%$........
>
>
> here is an empty sequence file, which is fine by consumer :
>
> SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable
> ^@^@^@^@^@^@<86>bÍI§ï8<97>ê=E^OÝ¢>^D
>
> Any idea ? Thanks in advance.
>
> Johnny
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message