hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Diwakar <ddeepa...@gmail.com>
Subject Re: Hadoop not splitting bzip2
Date Wed, 20 Apr 2011 05:47:47 GMT
Hi Harsh,

Thanks for the input.

Yeah when I went through Hadoop-0.20.1 and Hadoop-0.21.0 code , got the same
impression . But Since there are lots of changes in 0.21 and hence thought
to still use 0.20.1. But to use Split-able feature of bzip2  tried changing
FileInPutFormat by extending that but appears it was working fine for 500MB
size of Bzip2 files but not for ~2GB size of bzip2 files where block size is
64MB. I think there are few more dependencies which I have not modified.

When It was failing - actually it doesn't say task is  failed , instead
reducer kept trying running again and again.  And during retry  it fails
actually in suffle phase, msg  pasted below: -

 INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from
> attempt_201104102321_0019_m_000022_0
> java.io.IOException: Premature EOF
>         at
> sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:538)
>         at
> sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:582)
>         at
> sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:669)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2446)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleToDisk(ReduceTask.java:1624)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1416)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>

*Do any body have some patch to append in Hadoop-0.20.1 to support bzip2
splitable? Would be really helpful.
*

Thanks & regards,
- Deepak Diwakar,




On 20 April 2011 00:37, Harsh J <harsh@cloudera.com> wrote:

> Hello Deepak,
>
> On Tue, Apr 19, 2011 at 9:33 PM, Deepak Diwakar <ddeepak4u@gmail.com>
> wrote:
> > Hi,
> >
> >  I am using hadoop-0.20.1
> > But when I use my own InputFormat say SafeInputFormat( extends
> > FileInputFormat ) and allow isSplitable true. It executes multiple
> mappers,
> > but fails when reducers reaches 33% for the large size(of order of 2 GB)
> of
> > bzip2 files.
>
> BZip2 splitting support was added to Apache Hadoop 0.21.0 release, and
> isn't available in the Apache Hadoop 0.20.x. Was the 0.20.1 version a
> typo?
> Also, what reason/trace does the reducer throw up when it fails?
>
> --
> Harsh J
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message