hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: broken gzip file
Date Tue, 29 Jan 2008 22:36:08 GMT
Our change for this is mixed up in some other code we have, I will have 
to separate it out.

Arun C Murthy wrote:
>
> On Jan 29, 2008, at 1:30 PM, Jason Venner wrote:
>
>> We have overridden the base class public class MapReduceBase extends 
>> org.apache.hadoop.mapred.MapReduceBase
>> to have the configure method log the split name and split section (or 
>> in the case of gzip'd files the file name).
>>
>> We find it very helpful to make the job errors to the section of the 
>> input file causing the problem.
>>
>
> Maybe we should just log it by default? Want to submit that patch?
>
> Arun
>
>>
>> Vadim Zaliva wrote:
>>> I have a bunch of gzip files which I am trying to process with 
>>> Hadoop task. The task fails with exception:
>>> java.io.EOFException: Unexpected end of ZLIB input stream at 
>>> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) 
>>> at 
>>> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) 
>>> at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at 
>>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.read(GzipCodec.java:124)

>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at 
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:237) at 
>>> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136)

>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128)

>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117) 
>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39) 
>>> at 
>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) 
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at 
>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at 
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
>>> I guess some of files are invalid. However I could not find anywhere 
>>> in logs file name of the file causing this exception. Due to the 
>>> huge size of the dataset I would not want to extract files from DFS 
>>> and verify them with Gzip one by one. Any suggestions? Thanks!
>>> Sincerely,
>>> Vadim
>>>
>>>
>>
>> -- 
>> Jason Venner
>> Attributor - Publish with Confidence <http://www.attributor.com/>
>> Attributor is hiring Hadoop Wranglers, contact if interested
>

Mime
View raw message