hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From brainstorm <brainc...@gmail.com>
Subject Re: Too many fetch failures AND Shuffle error
Date Sat, 19 Jul 2008 10:05:23 GMT
Got this problem too, and fixed it just 5 minutes ago... there were
wrong IP entries on the nodes referring to the frontend, it was
slowing down the reduce process *a lot*... in numbers:

Wrong hosts file using wordcount example: 3hrs, 45mins, 41sec (4
minutes map, the rest, reduce)
Right hosts file using wordcount example: 6mins, 26sec

Moral of the history: AVOID static hosts file, always use DNS.

PD: Static hosts files were replicated by rocksclusters to all compute
nodes on install (kickstart) time, but not refreshed afterwards while
doing "rocks sync dns" nor "rocks sync config".

On Fri, Jul 11, 2008 at 8:24 AM, Shengkai Zhu <geniusjash@gmail.com> wrote:
> This is also how I fixed this problem.
>
> On 6/21/08, Sayali Kulkarni <sayali_s_kulkarni@yahoo.co.in> wrote:
>>
>> Hi!
>>
>> My problem of "Too many fetch failures" as well as "shuffle error" was
>> resolved when I added the list of all the slave machines in the /etc/hosts
>> file.
>>
>> Earlier on every slave I just had the entries of the master and own machine
>> in the /etc/hosts file. But now I have updated all the /etc/hosts files to
>> include the IP address and the names of all the machines in the cluster and
>> my problem is resolved.
>>
>> One question still,
>> I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
>> cluster, say of 1000+ nodes, is it expected that every time a new machine is
>> added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
>> machines in the cluster?
>>
>>
>> Regards,
>> Sayali
>>
>> Sayali Kulkarni <sayali_s_kulkarni@yahoo.co.in> wrote:
>> > Can you post the reducer logs. How many nodes are there in the cluster?
>> There are 6 nodes in the cluster - 1 master and 5 slaves
>> I tried to reduce the number of nodes, and found that the problem is solved
>> only if there is a single node in the cluster. So I can deduce that the
>> problem is there in some configuration.
>>
>> Configuration file:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> <property>
>> <name>hadoop.tmp.dir</name>
>> <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>> <description>A base for other temporary directories.</description>
>> </property>
>>
>> <property>
>> <name>fs.default.name</name>
>> <value>hdfs://10.105.41.25:54310</value>
>>   <description>The name of the default file system.  A URI whose
>> scheme and authority determine the FileSystem implementation.  The
>> uri's scheme determines the config property (fs.SCHEME.impl) naming
>> the FileSystem implementation class.  The uri's authority is used to
>> determine the host, port, etc. for a filesystem.</description>
>> </property>
>>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>10.105.41.25:54311</value>
>> <description>The host and port that the MapReduce job tracker runs
>> at.  If "local", then jobs are run in-process as a single map
>> and reduce task.
>> </description>
>> </property>
>>
>> <property>
>> <name>dfs.replication</name>
>> <value>2</value>
>> <description>Default block replication.
>> The actual  number of replications can be specified when the file is
>> created.
>> The default is used if replication is not specified in create time.
>> </description>
>> </property>
>>
>>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx1048M</value>
>> </property>
>>
>> <property>
>>        <name>mapred.local.dir</name>
>>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
>> </property>
>>
>> <property>
>> <name>mapred.map.tasks</name>
>> <value>53</value>
>> <description>The default number of map tasks per job.  Typically set
>> to a prime several times greater than number of available hosts.
>> Ignored when mapred.job.tracker is "local".
>>   </description>
>> </property>
>>
>> <property>
>> <name>mapred.reduce.tasks</name>
>> <value>7</value>
>> <description>The default number of reduce tasks per job.  Typically set
>> to a prime close to the number of available hosts.  Ignored when
>> mapred.job.tracker is "local".
>> </description>
>> </property>
>>
>> </configuration>
>>
>>
>> ============
>> This is the output that I get when running the tasks with 2 nodes in the
>> cluster:
>>
>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process
>> : 1
>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>> 08/06/20 11:07:57 INFO mapred.JobClient:  map  26% reduce 0%
>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/06/20  11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000002_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000011_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:09:57 INFO  mapred.JobClient:  map 95% reduce 9%
>> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000001_0, Status : FAILED
>> Shuffle Error: Exceeded  MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000003_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000020_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000017_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95%  reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000007_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000012_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
>> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
>> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000002_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
>> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000006_0, Status :  FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_1, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
>> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000010_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
>> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
>> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
>> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
>> 08/06/20 11:12:16  INFO mapred.JobClient:  map 100% reduce 62%
>> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
>> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
>> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
>> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_1, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
>> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
>> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>>
>> ===============
>>
>> > Are you seeing this for all the maps and reducers?
>> Yes, this happens on all the maps and reducers. I tried to keep just 2
>> nodes in the cluster but still the problem exists.
>>
>> > Are the reducers progressing at all?
>> The reducers continue to execute upto a certain point, but after that they
>> just do not proceed at all. They just stop at  an average of 16%.
>>
>> > Are all the maps that the reducer is failing from a remote machine?
>> Yes.
>>
>> > Are all the failed maps/reducers from the same machine?
>> All the maps and reducers are failing anyways.
>>
>> Thanks for the help in advance,
>>
>> Regards,
>> Sayali
>>
>>
>> ---------------------------------
>> Sent from Yahoo! Mail.
>> A Smarter Email.
>>
>>
>> ---------------------------------
>> Sent from Yahoo! Mail.
>> A Smarter Email.
>

Mime
View raw message