hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bmdevelopment <bmdevelopm...@gmail.com>
Subject Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Fri, 09 Jul 2010 03:26:12 GMT
Thanks everyone.

Yes, using the Google Code version referenced on the wiki:
http://wiki.apache.org/hadoop/UsingLzoCompression

I will try the latest version and see if that fixes the problem.
http://github.com/kevinweil/hadoop-lzo

Thanks

On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <todd@cloudera.com> wrote:
> On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> Todd fixed a bug where LZO header or block header data may fall on read
>> boundary:
>>
>> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>>
>>
>> I am wondering if that is related to the issue you saw.
>
> I don't think this bug would show up in intermediate output compression, but
> it's certainly possible. There have been a number of bugs fixed in LZO over
> on github - are you using the github version or the one from Google Code
> which is out of date? Either mine or Kevin's repo on github should be a good
> version (I think we called the newest 0.3.4)
> -Todd
>
>>
>> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <bmdevelopment@gmail.com>
>> wrote:
>>>
>>> A little more on this.
>>>
>>> So, I've narrowed down the problem to using Lzop compression
>>> (com.hadoop.compression.lzo.LzopCodec)
>>> for mapred.map.output.compression.codec.
>>>
>>> <property>
>>>    <name>mapred.map.output.compression.codec</name>
>>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>>> </property>
>>>
>>> If I do the above, I will get the Shuffle Error.
>>> If I use DefaultCodec for mapred.map.output.compression.codec.
>>> there is no problem.
>>>
>>> Is this a known issue? Or is this a bug?
>>> Doesn't seem like it should be the expected behavior.
>>>
>>> I would be glad to contribute any further info on this if necessary.
>>> Please let me know.
>>>
>>> Thanks
>>>
>>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <bmdevelopment@gmail.com>
>>> wrote:
>>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>>> >
>>> > I agree that it must be a configuration problem and so today I was able
>>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node
>>> > cluster.
>>> >
>>> > I've now noticed that the error occurs when compression is enabled.
>>> > I've run the basic wordcount example as so:
>>> > http://pastebin.com/wvDMZZT0
>>> > and get the Shuffle Error.
>>> >
>>> > TT logs show this error:
>>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid
>>> > header checksum: 225702cc (expected 0x2325)
>>> > Full logs:
>>> > http://pastebin.com/fVGjcGsW
>>> >
>>> > My mapred-site.xml:
>>> > http://pastebin.com/mQgMrKQw
>>> >
>>> > If I remove the compression config settings, the wordcount works fine
>>> > - no more Shuffle Error.
>>> > So, I have something wrong with my compression settings I imagine.
>>> > I'll continue looking into this to see what else I can find out.
>>> >
>>> > Thanks a million.
>>> >
>>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <yhemanth@gmail.com>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> Sorry, I couldn't take a close look at the logs until now.
>>> >> Unfortunately, I could not see any huge difference between the success
>>> >> and failure case. Can you please check if things like basic hostname
-
>>> >> ip address mapping are in place (if you have static resolution of
>>> >> hostnames set up) ? A web search is giving this as the most likely
>>> >> cause users have faced regarding this problem. Also do the disks have
>>> >> enough size ? Also, it would be great if you can upload your hadoop
>>> >> configuration information.
>>> >>
>>> >> I do think it is very likely that configuration is the actual problem
>>> >> because it works in one case anyway.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>>> >> <bmdevelopment@gmail.com> wrote:
>>> >>> Hello,
>>> >>> I still have had no luck with this over the past week.
>>> >>> And even get the same exact problem on a completely different 5
node
>>> >>> cluster.
>>> >>> Is it worth opening an new issue in jira for this?
>>> >>> Thanks
>>> >>>
>>> >>>
>>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>>> >>> <bmdevelopment@gmail.com> wrote:
>>> >>>> Hello,
>>> >>>> Thanks so much for the reply.
>>> >>>> See inline.
>>> >>>>
>>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>>> >>>> <yhemanth@gmail.com> wrote:
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>>> I've been getting the following error when trying to
run a very
>>> >>>>>> simple
>>> >>>>>> MapReduce job.
>>> >>>>>> Map finishes without problem, but error occurs as soon
as it
>>> >>>>>> enters
>>> >>>>>> Reduce phase.
>>> >>>>>>
>>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> >>>>>>
>>> >>>>>> I am running a 5 node cluster and I believe I have all
my settings
>>> >>>>>> correct:
>>> >>>>>>
>>> >>>>>> * ulimit -n 32768
>>> >>>>>> * DNS/RDNS configured properly
>>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>> >>>>>>
>>> >>>>>> The program is very simple - just counts a unique string
in a log
>>> >>>>>> file.
>>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>>> >>>>>>
>>> >>>>>> When I run, the job fails and I get the following output.
>>> >>>>>> http://pastebin.com/AhW6StEb
>>> >>>>>>
>>> >>>>>> However, runs fine when I do *not* use substring() on
the value
>>> >>>>>> (see
>>> >>>>>> map function in code above).
>>> >>>>>>
>>> >>>>>> This runs fine and completes successfully:
>>> >>>>>>            String str = val.toString();
>>> >>>>>>
>>> >>>>>> This causes error and fails:
>>> >>>>>>            String str = val.toString().substring(0,10);
>>> >>>>>>
>>> >>>>>> Please let me know if you need any further information.
>>> >>>>>> It would be greatly appreciated if anyone could shed
some light on
>>> >>>>>> this problem.
>>> >>>>>
>>> >>>>> It catches attention that changing the code to use a substring
is
>>> >>>>> causing a difference. Assuming it is consistent and not
a red
>>> >>>>> herring,
>>> >>>>
>>> >>>> Yes, this has been consistent over the last week. I was running
>>> >>>> 0.20.1
>>> >>>> first and then
>>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>>> >>>>
>>> >>>>> can you look at the counters for the two jobs using the
JobTracker
>>> >>>>> web
>>> >>>>> UI - things like map records, bytes etc and see if there
is a
>>> >>>>> noticeable difference ?
>>> >>>>
>>> >>>> Ok, so here is the first job using write.set(value.toString());
>>> >>>> having
>>> >>>> *no* errors:
>>> >>>> http://pastebin.com/xvy0iGwL
>>> >>>>
>>> >>>> And here is the second job using
>>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>>> >>>> http://pastebin.com/uGw6yNqv
>>> >>>>
>>> >>>> And here is even another where I used a longer, and therefore
unique
>>> >>>> string,
>>> >>>> by write.set(value.toString().substring(0, 20)); This makes
every
>>> >>>> line
>>> >>>> unique, similar to first job.
>>> >>>> Still fails.
>>> >>>> http://pastebin.com/GdQ1rp8i
>>> >>>>
>>> >>>>>Also, are the two programs being run against
>>> >>>>> the exact same input data ?
>>> >>>>
>>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> >>>> Using a shorter string leads to more like keys and therefore
more
>>> >>>> combining/reducing, but going
>>> >>>> by the above it seems to fail whether the substring/key is entirely
>>> >>>> unique (23000 combine output records) or
>>> >>>> mostly the same (9 combine output records).
>>> >>>>
>>> >>>>>
>>> >>>>> Also, since the cluster size is small, you could also look
at the
>>> >>>>> tasktracker logs on the machines where the maps have run
to see if
>>> >>>>> there are any failures when the reduce attempts start failing.
>>> >>>>
>>> >>>> Here is the TT log from the last failed job. I do not see anything
>>> >>>> besides the shuffle failure, but there
>>> >>>> may be something I am overlooking or simply do not understand.
>>> >>>> http://pastebin.com/DKFTyGXg
>>> >>>>
>>> >>>> Thanks again!
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks
>>> >>>>> Hemanth
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message