hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shengkai Zhu" <geniusj...@gmail.com>
Subject Re: Too many fetch failures AND Shuffle error
Date Fri, 11 Jul 2008 06:24:02 GMT
This is also how I fixed this problem.

On 6/21/08, Sayali Kulkarni <sayali_s_kulkarni@yahoo.co.in> wrote:
>
> Hi!
>
> My problem of "Too many fetch failures" as well as "shuffle error" was
> resolved when I added the list of all the slave machines in the /etc/hosts
> file.
>
> Earlier on every slave I just had the entries of the master and own machine
> in the /etc/hosts file. But now I have updated all the /etc/hosts files to
> include the IP address and the names of all the machines in the cluster and
> my problem is resolved.
>
> One question still,
> I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
> cluster, say of 1000+ nodes, is it expected that every time a new machine is
> added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
> machines in the cluster?
>
>
> Regards,
> Sayali
>
> Sayali Kulkarni <sayali_s_kulkarni@yahoo.co.in> wrote:
> > Can you post the reducer logs. How many nodes are there in the cluster?
> There are 6 nodes in the cluster - 1 master and 5 slaves
> I tried to reduce the number of nodes, and found that the problem is solved
> only if there is a single node in the cluster. So I can deduce that the
> problem is there in some configuration.
>
> Configuration file:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
> <name>hadoop.tmp.dir</name>
> <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
> <description>A base for other temporary directories.</description>
> </property>
>
> <property>
> <name>fs.default.name</name>
> <value>hdfs://10.105.41.25:54310</value>
>   <description>The name of the default file system.  A URI whose
> scheme and authority determine the FileSystem implementation.  The
> uri's scheme determines the config property (fs.SCHEME.impl) naming
> the FileSystem implementation class.  The uri's authority is used to
> determine the host, port, etc. for a filesystem.</description>
> </property>
>
> <property>
> <name>mapred.job.tracker</name>
> <value>10.105.41.25:54311</value>
> <description>The host and port that the MapReduce job tracker runs
> at.  If "local", then jobs are run in-process as a single map
> and reduce task.
> </description>
> </property>
>
> <property>
> <name>dfs.replication</name>
> <value>2</value>
> <description>Default block replication.
> The actual  number of replications can be specified when the file is
> created.
> The default is used if replication is not specified in create time.
> </description>
> </property>
>
>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx1048M</value>
> </property>
>
> <property>
>        <name>mapred.local.dir</name>
>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
> </property>
>
> <property>
> <name>mapred.map.tasks</name>
> <value>53</value>
> <description>The default number of map tasks per job.  Typically set
> to a prime several times greater than number of available hosts.
> Ignored when mapred.job.tracker is "local".
>   </description>
> </property>
>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>7</value>
> <description>The default number of reduce tasks per job.  Typically set
> to a prime close to the number of available hosts.  Ignored when
> mapred.job.tracker is "local".
> </description>
> </property>
>
> </configuration>
>
>
> ============
> This is the output that I get when running the tasks with 2 nodes in the
> cluster:
>
> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
> 08/06/20 11:07:57 INFO mapred.JobClient:  map  26% reduce 0%
> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
> 08/06/20  11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_r_000002_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_r_000003_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000011_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:09:57 INFO  mapred.JobClient:  map 95% reduce 9%
> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded  MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000003_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000020_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000017_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95%  reduce 17%
> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000007_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000012_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000019_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000002_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000006_0, Status :  FAILED
> Too many fetch-failures
> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_r_000003_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000010_0, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
> 08/06/20 11:12:16  INFO mapred.JobClient:  map 100% reduce 62%
> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
> task_200806201106_0001_m_000019_1, Status : FAILED
> Too many fetch-failures
> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>
> ===============
>
> > Are you seeing this for all the maps and reducers?
> Yes, this happens on all the maps and reducers. I tried to keep just 2
> nodes in the cluster but still the problem exists.
>
> > Are the reducers progressing at all?
> The reducers continue to execute upto a certain point, but after that they
> just do not proceed at all. They just stop at  an average of 16%.
>
> > Are all the maps that the reducer is failing from a remote machine?
> Yes.
>
> > Are all the failed maps/reducers from the same machine?
> All the maps and reducers are failing anyways.
>
> Thanks for the help in advance,
>
> Regards,
> Sayali
>
>
> ---------------------------------
> Sent from Yahoo! Mail.
> A Smarter Email.
>
>
> ---------------------------------
> Sent from Yahoo! Mail.
> A Smarter Email.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message