Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of suhailrehman@gmail.com
 designates 74.125.92.25 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=nDQA9EPF5PfkhOxPBX7K8i93u51UXnQTJKyeLrQ5n7IpUoH6yIU0GPWgHS3R/HEJFP
         MlQjzZJReapwjF8xJ5Y4NtU6dphtSJUdrsre/LfNW5iZw7v8zPC8t3UYsPQuFV8IWUX/
         wUEyRVflY4pr0qb1x9Zei9WH74fTJn+vCqoFM=
MIME-Version: 1.0
In-Reply-To: <31a243e71001261052w488620d5m9f9bc15e23759728@mail.gmail.com>
References: <b2fd1d941001200437w18b24521h7ad86fdb00520d94@mail.gmail.com>
	 <C77D0F76.FC0E%amarsri@yahoo-inc.com>
	 <b2fd1d941001210213w48bf2658web51d1a35377ad7b@mail.gmail.com>
	 <b2fd1d941001260134h6688b294hb451c09c42fdc619@mail.gmail.com>
	 <31a243e71001261052w488620d5m9f9bc15e23759728@mail.gmail.com>
Date: Tue, 26 Jan 2010 22:05:56 +0300
Message-ID: <b2fd1d941001261105r30cf10a6s913e423ae9d7a3a9@mail.gmail.com>
Subject: Re: Reducers are stuck fetching map data.
From: Suhail Rehman <suhailrehman@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015175766cebd31da047e15f957

--0015175766cebd31da047e15f957
Content-Type: text/plain; charset=ISO-8859-1

Yes, will be immensely helpful for others.

Suhail

On Tue, Jan 26, 2010 at 9:52 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> You mean that documentation?
>
> http://hadoop.apache.org/common/docs/r0.20.1/quickstart.html#Required+Software
>
> J-D
>
> On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman <suhailrehman@gmail.com>
> wrote:
> > We finally figured it out! The problem was with the JDK installation on
> our
> > VMs, it was configured to use IBM JDK, and the moment we switched to Sun,
> > everything now works flawlessly.
> >
> > You may want to include this information somewhere in the documentation
> that
> > you strongly recommend Sun JDK to be used with Hadoop.
> >
> > Suhail
> >
> > On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman <suhailrehman@gmail.com>
> > wrote:
> >>
> >> We have verified that it does NOT solve the problem at all.  This would
> >> lead us to believe that the timeout issue we are experiencing is not
> part of
> >> the shuffle phase. Any other ideas that might help us?
> >>
> >> The Tasktracker logs show that these reducers are stuck during the copy
> >> phase.
> >>
> >> Suhail
> >>
> >> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu
> >> <amarsri@yahoo-inc.com> wrote:
> >>>
> >>> ReadTimeOuts are found to be costly during shuffle, if the map runtime
> is
> >>> high.
> >>> Please see HADOOP-3327(
> http://issues.apache.org/jira/browse/HADOOP-3327)
> >>> for shuffle improvements done for ReadTimeOut specificlly
> >>>
> >>> Thanks
> >>> Amareshwari
> >>>
> >>> On 1/20/10 6:07 PM, "Suhail Rehman" <suhailrehman@gmail.com> wrote:
> >>>
> >>> We are having trouble running Hadoop MapReduce jobs on our cluster.
> >>>
> >>> VMs running on an IBM blade center with the following virtualized
> >>> configuration:
> >>>
> >>> Master Node/Namenode: 1x
> >>> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
> >>> Slaves/DataNode: 3x
> >>> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
> >>>
> >>> We are working with standard Hadoop example code. We are using Hadoop
> >>> 0.20.1, stable with the latest patches installed. All VMs have
> firewalls
> >>> turned off as well as SELinux disabled.
> >>>
> >>> For example, while we try to execute the "wordcount" program on a
> >>> provisioned cluster, the Map operations complete successfully, the
> program
> >>> is stuck trying to complete the reduce operations.
> >>>
> >>> On examining the logs, we find that the Reducers are waiting for the
> >>> outputs from Map operations on other nodes. Our understanding is that
> this
> >>> communication happens over HTTP sockets and all these provisioned VMs
> have
> >>> trouble communicating over the HTTP sockets on the ports that Hadoop
> uses.
> >>>
> >>> Also, while trying to access the JobTracker web interface to view the
> >>> running jobs, we see that the machine is taking too much time to
> respond to
> >>> our queries. Since both of the Reducer communication and the JobTracker
> web
> >>> interface works over HTTP, we think the problem might be a networking
> issue
> >>> or a problem with the built-in HTTP service in Hadoop (Jetty).
> >>>
> >>> Attached is a partial Task log from one of the Reducers,
> >>> "WARN org.apache.hadoop.mapred.ReduceTask:
> >>> java.net.SocketTimeoutException: Read timed out"
> >>> appears on all reducers, and eventually the Job either fails to
> complete
> >>> or takes a very long time (about 15 hours to process a 11 GB text
> file).
> >>>
> >>> This problem seems to be random and at times the program runs
> sucessfully
> >>> in about 20 mins, othertimes it completes the operation in 15 hours.
> >>>
> >>> Any help with regards to this would be much appreciated.
> >>>
> >>> Regards,
> >>>
> >>> Suhail Rehman
> >>> MS by Research in Computer Science
> >>> International Institute of Information Technology - Hyderabad
> >>> rehman@research.iiit.ac.in
> >>> ---------------------------------------------------------------------
> >>> http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> Suhail Rehman
> >> MS by Research in Computer Science
> >> International Institute of Information Technology - Hyderabad
> >> rehman@research.iiit.ac.in
> >> ---------------------------------------------------------------------
> >> http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >
> >
> >
> > --
> > Regards,
> >
> > Suhail Rehman
> > MS by Research in Computer Science
> > International Institute of Information Technology - Hyderabad
> > rehman@research.iiit.ac.in
> > ---------------------------------------------------------------------
> > http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >
>


-- 
Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

--0015175766cebd31da047e15f957
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yes, will be immensely helpful for others.<br><br>Suhail<b=
r><br><div class=3D"gmail_quote">On Tue, Jan 26, 2010 at 9:52 PM, Jean-Dani=
el Cryans <span dir=3D"ltr">&lt;<a href=3D"mailto:jdcryans@apache.org">jdcr=
yans@apache.org</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">You mean that doc=
umentation?<br>
<a href=3D"http://hadoop.apache.org/common/docs/r0.20.1/quickstart.html#Req=
uired+Software" target=3D"_blank">http://hadoop.apache.org/common/docs/r0.2=
0.1/quickstart.html#Required+Software</a><br>
<font color=3D"#888888"><br>
J-D<br>
</font><div><div></div><div class=3D"h5"><br>
On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman &lt;<a href=3D"mailto:suhail=
rehman@gmail.com">suhailrehman@gmail.com</a>&gt; wrote:<br>
&gt; We finally figured it out! The problem was with the JDK installation o=
n our<br>
&gt; VMs, it was configured to use IBM JDK, and the moment we switched to S=
un,<br>
&gt; everything now works flawlessly.<br>
&gt;<br>
&gt; You may want to include this information somewhere in the documentatio=
n that<br>
&gt; you strongly recommend Sun JDK to be used with Hadoop.<br>
&gt;<br>
&gt; Suhail<br>
&gt;<br>
&gt; On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman &lt;<a href=3D"mailto:s=
uhailrehman@gmail.com">suhailrehman@gmail.com</a>&gt;<br>
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; We have verified that it does NOT solve the problem at all. =A0Thi=
s would<br>
&gt;&gt; lead us to believe that the timeout issue we are experiencing is n=
ot part of<br>
&gt;&gt; the shuffle phase. Any other ideas that might help us?<br>
&gt;&gt;<br>
&gt;&gt; The Tasktracker logs show that these reducers are stuck during the=
 copy<br>
&gt;&gt; phase.<br>
&gt;&gt;<br>
&gt;&gt; Suhail<br>
&gt;&gt;<br>
&gt;&gt; On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu<br>
&gt;&gt; &lt;<a href=3D"mailto:amarsri@yahoo-inc.com">amarsri@yahoo-inc.com=
</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; ReadTimeOuts are found to be costly during shuffle, if the map=
 runtime is<br>
&gt;&gt;&gt; high.<br>
&gt;&gt;&gt; Please see HADOOP-3327( <a href=3D"http://issues.apache.org/ji=
ra/browse/HADOOP-3327" target=3D"_blank">http://issues.apache.org/jira/brow=
se/HADOOP-3327</a>)<br>
&gt;&gt;&gt; for shuffle improvements done for ReadTimeOut specificlly<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Thanks<br>
&gt;&gt;&gt; Amareshwari<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On 1/20/10 6:07 PM, &quot;Suhail Rehman&quot; &lt;<a href=3D"m=
ailto:suhailrehman@gmail.com">suhailrehman@gmail.com</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; We are having trouble running Hadoop MapReduce jobs on our clu=
ster.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; VMs running on an IBM blade center with the following virtuali=
zed<br>
&gt;&gt;&gt; configuration:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Master Node/Namenode: 1x<br>
&gt;&gt;&gt; OS: =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Xen RedHat=
 Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB<br>
&gt;&gt;&gt; Slaves/DataNode: 3x<br>
&gt;&gt;&gt; OS: =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Xen RedHat=
 Linux 5.2 1 vCPU, 1024 MB RAM<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; We are working with standard Hadoop example code. We are using=
 Hadoop<br>
&gt;&gt;&gt; 0.20.1, stable with the latest patches installed. All VMs have=
 firewalls<br>
&gt;&gt;&gt; turned off as well as SELinux disabled.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; For example, while we try to execute the &quot;wordcount&quot;=
 program on a<br>
&gt;&gt;&gt; provisioned cluster, the Map operations complete successfully,=
 the program<br>
&gt;&gt;&gt; is stuck trying to complete the reduce operations.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On examining the logs, we find that the Reducers are waiting f=
or the<br>
&gt;&gt;&gt; outputs from Map operations on other nodes. Our understanding =
is that this<br>
&gt;&gt;&gt; communication happens over HTTP sockets and all these provisio=
ned VMs have<br>
&gt;&gt;&gt; trouble communicating over the HTTP sockets on the ports that =
Hadoop uses.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Also, while trying to access the JobTracker web interface to v=
iew the<br>
&gt;&gt;&gt; running jobs, we see that the machine is taking too much time =
to respond to<br>
&gt;&gt;&gt; our queries. Since both of the Reducer communication and the J=
obTracker web<br>
&gt;&gt;&gt; interface works over HTTP, we think the problem might be a net=
working issue<br>
&gt;&gt;&gt; or a problem with the built-in HTTP service in Hadoop (Jetty).=
<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Attached is a partial Task log from one of the Reducers,<br>
&gt;&gt;&gt; &quot;WARN org.apache.hadoop.mapred.ReduceTask:<br>
&gt;&gt;&gt; java.net.SocketTimeoutException: Read timed out&quot;<br>
&gt;&gt;&gt; appears on all reducers, and eventually the Job either fails t=
o complete<br>
&gt;&gt;&gt; or takes a very long time (about 15 hours to process a 11 GB t=
ext file).<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; This problem seems to be random and at times the program runs =
sucessfully<br>
&gt;&gt;&gt; in about 20 mins, othertimes it completes the operation in 15 =
hours.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Any help with regards to this would be much appreciated.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Regards,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Suhail Rehman<br>
&gt;&gt;&gt; MS by Research in Computer Science<br>
&gt;&gt;&gt; International Institute of Information Technology - Hyderabad<=
br>
&gt;&gt;&gt; <a href=3D"mailto:rehman@research.iiit.ac.in">rehman@research.=
iiit.ac.in</a><br>
&gt;&gt;&gt; --------------------------------------------------------------=
-------<br>
&gt;&gt;&gt; <a href=3D"http://research.iiit.ac.in/%7Erehman" target=3D"_bl=
ank">http://research.iiit.ac.in/~rehman</a><br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Regards,<br>
&gt;&gt;<br>
&gt;&gt; Suhail Rehman<br>
&gt;&gt; MS by Research in Computer Science<br>
&gt;&gt; International Institute of Information Technology - Hyderabad<br>
&gt;&gt; <a href=3D"mailto:rehman@research.iiit.ac.in">rehman@research.iiit=
.ac.in</a><br>
&gt;&gt; ------------------------------------------------------------------=
---<br>
&gt;&gt; <a href=3D"http://research.iiit.ac.in/%7Erehman" target=3D"_blank"=
>http://research.iiit.ac.in/~rehman</a><br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Regards,<br>
&gt;<br>
&gt; Suhail Rehman<br>
&gt; MS by Research in Computer Science<br>
&gt; International Institute of Information Technology - Hyderabad<br>
&gt; <a href=3D"mailto:rehman@research.iiit.ac.in">rehman@research.iiit.ac.=
in</a><br>
&gt; ---------------------------------------------------------------------<=
br>
&gt; <a href=3D"http://research.iiit.ac.in/%7Erehman" target=3D"_blank">htt=
p://research.iiit.ac.in/~rehman</a><br>
&gt;<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Regards,<br=
><br>Suhail Rehman<br>MS by Research in Computer Science<br>International I=
nstitute of Information Technology - Hyderabad<br><a href=3D"mailto:rehman@=
research.iiit.ac.in">rehman@research.iiit.ac.in</a><br>
---------------------------------------------------------------------<br><a=
 href=3D"http://research.iiit.ac.in/~rehman">http://research.iiit.ac.in/~re=
hman</a><br>
</div>

--0015175766cebd31da047e15f957--