hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Thu, 08 Jul 2010 17:38:15 GMT
Todd fixed a bug where LZO header or block header data may fall on read
boundary:
http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58

I am wondering if that is related to the issue you saw.

On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bmdevelopment@gmail.com>wrote:

> A little more on this.
>
> So, I've narrowed down the problem to using Lzop compression
> (com.hadoop.compression.lzo.LzopCodec)
> for mapred.map.output.compression.codec.
>
> <property>
>    <name>mapred.map.output.compression.codec</name>
>    <value>com.hadoop.compression.lzo.LzopCodec</value>
> </property>
>
> If I do the above, I will get the Shuffle Error.
> If I use DefaultCodec for mapred.map.output.compression.codec.
> there is no problem.
>
> Is this a known issue? Or is this a bug?
> Doesn't seem like it should be the expected behavior.
>
> I would be glad to contribute any further info on this if necessary.
> Please let me know.
>
> Thanks
>
> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bmdevelopment@gmail.com>
> wrote:
> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
> >
> > I agree that it must be a configuration problem and so today I was able
> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
> cluster.
> >
> > I've now noticed that the error occurs when compression is enabled.
> > I've run the basic wordcount example as so:
> > http://pastebin.com/wvDMZZT0
> > and get the Shuffle Error.
> >
> > TT logs show this error:
> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
> > header checksum: 225702cc (expected 0x2325)
> > Full logs:
> > http://pastebin.com/fVGjcGsW
> >
> > My mapred-site.xml:
> > http://pastebin.com/mQgMrKQw
> >
> > If I remove the compression config settings, the wordcount works fine
> > - no more Shuffle Error.
> > So, I have something wrong with my compression settings I imagine.
> > I'll continue looking into this to see what else I can find out.
> >
> > Thanks a million.
> >
> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yhemanth@gmail.com>
> wrote:
> >> Hi,
> >>
> >> Sorry, I couldn't take a close look at the logs until now.
> >> Unfortunately, I could not see any huge difference between the success
> >> and failure case. Can you please check if things like basic hostname -
> >> ip address mapping are in place (if you have static resolution of
> >> hostnames set up) ? A web search is giving this as the most likely
> >> cause users have faced regarding this problem. Also do the disks have
> >> enough size ? Also, it would be great if you can upload your hadoop
> >> configuration information.
> >>
> >> I do think it is very likely that configuration is the actual problem
> >> because it works in one case anyway.
> >>
> >> Thanks
> >> Hemanth
> >>
> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bmdevelopment@gmail.com>
> wrote:
> >>> Hello,
> >>> I still have had no luck with this over the past week.
> >>> And even get the same exact problem on a completely different 5 node
> cluster.
> >>> Is it worth opening an new issue in jira for this?
> >>> Thanks
> >>>
> >>>
> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <
> bmdevelopment@gmail.com> wrote:
> >>>> Hello,
> >>>> Thanks so much for the reply.
> >>>> See inline.
> >>>>
> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <
> yhemanth@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>>> I've been getting the following error when trying to run a very
> simple
> >>>>>> MapReduce job.
> >>>>>> Map finishes without problem, but error occurs as soon as it
enters
> >>>>>> Reduce phase.
> >>>>>>
> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>>
> >>>>>> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>>>>>
> >>>>>> * ulimit -n 32768
> >>>>>> * DNS/RDNS configured properly
> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>>
> >>>>>> The program is very simple - just counts a unique string in
a log
> file.
> >>>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>>
> >>>>>> When I run, the job fails and I get the following output.
> >>>>>> http://pastebin.com/AhW6StEb
> >>>>>>
> >>>>>> However, runs fine when I do *not* use substring() on the value
(see
> >>>>>> map function in code above).
> >>>>>>
> >>>>>> This runs fine and completes successfully:
> >>>>>>            String str = val.toString();
> >>>>>>
> >>>>>> This causes error and fails:
> >>>>>>            String str = val.toString().substring(0,10);
> >>>>>>
> >>>>>> Please let me know if you need any further information.
> >>>>>> It would be greatly appreciated if anyone could shed some light
on
> this problem.
> >>>>>
> >>>>> It catches attention that changing the code to use a substring is
> >>>>> causing a difference. Assuming it is consistent and not a red
> herring,
> >>>>
> >>>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>>> first and then
> >>>> upgrade to 0.20.2 but results have been exactly the same.
> >>>>
> >>>>> can you look at the counters for the two jobs using the JobTracker
> web
> >>>>> UI - things like map records, bytes etc and see if there is a
> >>>>> noticeable difference ?
> >>>>
> >>>> Ok, so here is the first job using write.set(value.toString()); having
> >>>> *no* errors:
> >>>> http://pastebin.com/xvy0iGwL
> >>>>
> >>>> And here is the second job using
> >>>> write.set(value.toString().substring(0, 10)); that fails:
> >>>> http://pastebin.com/uGw6yNqv
> >>>>
> >>>> And here is even another where I used a longer, and therefore unique
> string,
> >>>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>>> unique, similar to first job.
> >>>> Still fails.
> >>>> http://pastebin.com/GdQ1rp8i
> >>>>
> >>>>>Also, are the two programs being run against
> >>>>> the exact same input data ?
> >>>>
> >>>> Yes, exactly the same input: a single csv file with 23K lines.
> >>>> Using a shorter string leads to more like keys and therefore more
> >>>> combining/reducing, but going
> >>>> by the above it seems to fail whether the substring/key is entirely
> >>>> unique (23000 combine output records) or
> >>>> mostly the same (9 combine output records).
> >>>>
> >>>>>
> >>>>> Also, since the cluster size is small, you could also look at the
> >>>>> tasktracker logs on the machines where the maps have run to see
if
> >>>>> there are any failures when the reduce attempts start failing.
> >>>>
> >>>> Here is the TT log from the last failed job. I do not see anything
> >>>> besides the shuffle failure, but there
> >>>> may be something I am overlooking or simply do not understand.
> >>>> http://pastebin.com/DKFTyGXg
> >>>>
> >>>> Thanks again!
> >>>>
> >>>>>
> >>>>> Thanks
> >>>>> Hemanth
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
View raw message