Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 20375 invoked from network); 26 Jan 2010 19:06:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jan 2010 19:06:26 -0000 Received: (qmail 12914 invoked by uid 500); 26 Jan 2010 19:06:26 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 12869 invoked by uid 500); 26 Jan 2010 19:06:25 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 12860 invoked by uid 99); 26 Jan 2010 19:06:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 19:06:25 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of suhailrehman@gmail.com designates 74.125.92.25 as permitted sender) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jan 2010 19:06:17 +0000 Received: by qw-out-2122.google.com with SMTP id 9so237622qwb.35 for ; Tue, 26 Jan 2010 11:05:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=yOiBvWBWFRoVW0J3QDr8ZLn/w6GNFbFaBk93YPN+6aU=; b=bkuFOPbqxDflyFMpbmfMZ8tR0nEG7bpPjsCOXZt1QwcenhZ9KSvPPo/EGq/xptwnbe yyBR+SRC8wFiNWpHexyvmdrQExAY5cuhgSgc0SCY8sbBRFfGTKNJcw21o5HewWgbnwLF +CINsQDoXyE0FgoJYAQNHMXhoiZ6nr+c4CO+4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=nDQA9EPF5PfkhOxPBX7K8i93u51UXnQTJKyeLrQ5n7IpUoH6yIU0GPWgHS3R/HEJFP MlQjzZJReapwjF8xJ5Y4NtU6dphtSJUdrsre/LfNW5iZw7v8zPC8t3UYsPQuFV8IWUX/ wUEyRVflY4pr0qb1x9Zei9WH74fTJn+vCqoFM= MIME-Version: 1.0 Received: by 10.229.3.193 with SMTP id 1mr4660967qco.29.1264532756180; Tue, 26 Jan 2010 11:05:56 -0800 (PST) In-Reply-To: <31a243e71001261052w488620d5m9f9bc15e23759728@mail.gmail.com> References: <31a243e71001261052w488620d5m9f9bc15e23759728@mail.gmail.com> Date: Tue, 26 Jan 2010 22:05:56 +0300 Message-ID: Subject: Re: Reducers are stuck fetching map data. From: Suhail Rehman To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175766cebd31da047e15f957 --0015175766cebd31da047e15f957 Content-Type: text/plain; charset=ISO-8859-1 Yes, will be immensely helpful for others. Suhail On Tue, Jan 26, 2010 at 9:52 PM, Jean-Daniel Cryans wrote: > You mean that documentation? > > http://hadoop.apache.org/common/docs/r0.20.1/quickstart.html#Required+Software > > J-D > > On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman > wrote: > > We finally figured it out! The problem was with the JDK installation on > our > > VMs, it was configured to use IBM JDK, and the moment we switched to Sun, > > everything now works flawlessly. > > > > You may want to include this information somewhere in the documentation > that > > you strongly recommend Sun JDK to be used with Hadoop. > > > > Suhail > > > > On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman > > wrote: > >> > >> We have verified that it does NOT solve the problem at all. This would > >> lead us to believe that the timeout issue we are experiencing is not > part of > >> the shuffle phase. Any other ideas that might help us? > >> > >> The Tasktracker logs show that these reducers are stuck during the copy > >> phase. > >> > >> Suhail > >> > >> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu > >> wrote: > >>> > >>> ReadTimeOuts are found to be costly during shuffle, if the map runtime > is > >>> high. > >>> Please see HADOOP-3327( > http://issues.apache.org/jira/browse/HADOOP-3327) > >>> for shuffle improvements done for ReadTimeOut specificlly > >>> > >>> Thanks > >>> Amareshwari > >>> > >>> On 1/20/10 6:07 PM, "Suhail Rehman" wrote: > >>> > >>> We are having trouble running Hadoop MapReduce jobs on our cluster. > >>> > >>> VMs running on an IBM blade center with the following virtualized > >>> configuration: > >>> > >>> Master Node/Namenode: 1x > >>> OS: Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB > >>> Slaves/DataNode: 3x > >>> OS: Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM > >>> > >>> We are working with standard Hadoop example code. We are using Hadoop > >>> 0.20.1, stable with the latest patches installed. All VMs have > firewalls > >>> turned off as well as SELinux disabled. > >>> > >>> For example, while we try to execute the "wordcount" program on a > >>> provisioned cluster, the Map operations complete successfully, the > program > >>> is stuck trying to complete the reduce operations. > >>> > >>> On examining the logs, we find that the Reducers are waiting for the > >>> outputs from Map operations on other nodes. Our understanding is that > this > >>> communication happens over HTTP sockets and all these provisioned VMs > have > >>> trouble communicating over the HTTP sockets on the ports that Hadoop > uses. > >>> > >>> Also, while trying to access the JobTracker web interface to view the > >>> running jobs, we see that the machine is taking too much time to > respond to > >>> our queries. Since both of the Reducer communication and the JobTracker > web > >>> interface works over HTTP, we think the problem might be a networking > issue > >>> or a problem with the built-in HTTP service in Hadoop (Jetty). > >>> > >>> Attached is a partial Task log from one of the Reducers, > >>> "WARN org.apache.hadoop.mapred.ReduceTask: > >>> java.net.SocketTimeoutException: Read timed out" > >>> appears on all reducers, and eventually the Job either fails to > complete > >>> or takes a very long time (about 15 hours to process a 11 GB text > file). > >>> > >>> This problem seems to be random and at times the program runs > sucessfully > >>> in about 20 mins, othertimes it completes the operation in 15 hours. > >>> > >>> Any help with regards to this would be much appreciated. > >>> > >>> Regards, > >>> > >>> Suhail Rehman > >>> MS by Research in Computer Science > >>> International Institute of Information Technology - Hyderabad > >>> rehman@research.iiit.ac.in > >>> --------------------------------------------------------------------- > >>> http://research.iiit.ac.in/~rehman > >>> > >> > >> > >> > >> -- > >> Regards, > >> > >> Suhail Rehman > >> MS by Research in Computer Science > >> International Institute of Information Technology - Hyderabad > >> rehman@research.iiit.ac.in > >> --------------------------------------------------------------------- > >> http://research.iiit.ac.in/~rehman > > > > > > > > -- > > Regards, > > > > Suhail Rehman > > MS by Research in Computer Science > > International Institute of Information Technology - Hyderabad > > rehman@research.iiit.ac.in > > --------------------------------------------------------------------- > > http://research.iiit.ac.in/~rehman > > > -- Regards, Suhail Rehman MS by Research in Computer Science International Institute of Information Technology - Hyderabad rehman@research.iiit.ac.in --------------------------------------------------------------------- http://research.iiit.ac.in/~rehman --0015175766cebd31da047e15f957 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Yes, will be immensely helpful for others.

Suhail
On Tue, Jan 26, 2010 at 9:52 PM, Jean-Dani= el Cryans <jdcr= yans@apache.org> wrote:
You mean that doc= umentation?
http://hadoop.apache.org/common/docs/r0.2= 0.1/quickstart.html#Required+Software

J-D

On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman <suhailrehman@gmail.com> wrote:
> We finally figured it out! The problem was with the JDK installation o= n our
> VMs, it was configured to use IBM JDK, and the moment we switched to S= un,
> everything now works flawlessly.
>
> You may want to include this information somewhere in the documentatio= n that
> you strongly recommend Sun JDK to be used with Hadoop.
>
> Suhail
>
> On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman <suhailrehman@gmail.com>
> wrote:
>>
>> We have verified that it does NOT solve the problem at all. =A0Thi= s would
>> lead us to believe that the timeout issue we are experiencing is n= ot part of
>> the shuffle phase. Any other ideas that might help us?
>>
>> The Tasktracker logs show that these reducers are stuck during the= copy
>> phase.
>>
>> Suhail
>>
>> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu
>> <amarsri@yahoo-inc.com= > wrote:
>>>
>>> ReadTimeOuts are found to be costly during shuffle, if the map= runtime is
>>> high.
>>> Please see HADOOP-3327( http://issues.apache.org/jira/brow= se/HADOOP-3327)
>>> for shuffle improvements done for ReadTimeOut specificlly
>>>
>>> Thanks
>>> Amareshwari
>>>
>>> On 1/20/10 6:07 PM, "Suhail Rehman" <suhailrehman@gmail.com> wrote:
>>>
>>> We are having trouble running Hadoop MapReduce jobs on our clu= ster.
>>>
>>> VMs running on an IBM blade center with the following virtuali= zed
>>> configuration:
>>>
>>> Master Node/Namenode: 1x
>>> OS: =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Xen RedHat= Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
>>> Slaves/DataNode: 3x
>>> OS: =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Xen RedHat= Linux 5.2 1 vCPU, 1024 MB RAM
>>>
>>> We are working with standard Hadoop example code. We are using= Hadoop
>>> 0.20.1, stable with the latest patches installed. All VMs have= firewalls
>>> turned off as well as SELinux disabled.
>>>
>>> For example, while we try to execute the "wordcount"= program on a
>>> provisioned cluster, the Map operations complete successfully,= the program
>>> is stuck trying to complete the reduce operations.
>>>
>>> On examining the logs, we find that the Reducers are waiting f= or the
>>> outputs from Map operations on other nodes. Our understanding = is that this
>>> communication happens over HTTP sockets and all these provisio= ned VMs have
>>> trouble communicating over the HTTP sockets on the ports that = Hadoop uses.
>>>
>>> Also, while trying to access the JobTracker web interface to v= iew the
>>> running jobs, we see that the machine is taking too much time = to respond to
>>> our queries. Since both of the Reducer communication and the J= obTracker web
>>> interface works over HTTP, we think the problem might be a net= working issue
>>> or a problem with the built-in HTTP service in Hadoop (Jetty).=
>>>
>>> Attached is a partial Task log from one of the Reducers,
>>> "WARN org.apache.hadoop.mapred.ReduceTask:
>>> java.net.SocketTimeoutException: Read timed out"
>>> appears on all reducers, and eventually the Job either fails t= o complete
>>> or takes a very long time (about 15 hours to process a 11 GB t= ext file).
>>>
>>> This problem seems to be random and at times the program runs = sucessfully
>>> in about 20 mins, othertimes it completes the operation in 15 = hours.
>>>
>>> Any help with regards to this would be much appreciated.
>>>
>>> Regards,
>>>
>>> Suhail Rehman
>>> MS by Research in Computer Science
>>> International Institute of Information Technology - Hyderabad<= br> >>> rehman@research.= iiit.ac.in
>>> --------------------------------------------------------------= -------
>>> http://research.iiit.ac.in/~rehman
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Suhail Rehman
>> MS by Research in Computer Science
>> International Institute of Information Technology - Hyderabad
>> rehman@research.iiit= .ac.in
>> ------------------------------------------------------------------= ---
>> http://research.iiit.ac.in/~rehman
>
>
>
> --
> Regards,
>
> Suhail Rehman
> MS by Research in Computer Science
> International Institute of Information Technology - Hyderabad
> rehman@research.iiit.ac.= in
> ---------------------------------------------------------------------<= br> > htt= p://research.iiit.ac.in/~rehman
>



--
Regards,
Suhail Rehman
MS by Research in Computer Science
International I= nstitute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~re= hman
--0015175766cebd31da047e15f957--