Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 5983 invoked from network); 15 Apr 2009 17:21:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Apr 2009 17:21:08 -0000 Received: (qmail 45956 invoked by uid 500); 15 Apr 2009 17:21:06 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 45856 invoked by uid 500); 15 Apr 2009 17:21:06 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 45846 invoked by uid 99); 15 Apr 2009 17:21:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Apr 2009 17:21:06 +0000 X-ASF-Spam-Status: No, hits=4.2 required=10.0 tests=HTML_MESSAGE,NO_RDNS_DOTCOM_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Apr 2009 17:20:55 +0000 Received: from sp1-ex07cas01.ds.corp.yahoo.com (sp1-ex07cas01.ds.corp.yahoo.com [216.252.116.137]) by mrout3.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id n3FHJsfs005033; Wed, 15 Apr 2009 10:19:54 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:from:to:date:subject:thread-topic:thread-index: message-id:in-reply-to:accept-language:content-language: x-ms-has-attach:x-ms-tnef-correlator:acceptlanguage:content-type:mime-version; b=Yc8sHisjz4JUfqV81plE0ko7Y6SCbtOVehtaLflRYhdM/jJvD/VcKi9ZhUqrM1XH Received: from SP1-EX07VS01.ds.corp.yahoo.com ([216.252.116.136]) by sp1-ex07cas01.ds.corp.yahoo.com ([216.252.116.137]) with mapi; Wed, 15 Apr 2009 10:19:54 -0700 From: Ravi Phulari To: "core-user@hadoop.apache.org" , Mithila Nagendra Date: Wed, 15 Apr 2009 10:19:51 -0700 Subject: Re: Map-Reduce Slow Down Thread-Topic: Map-Reduce Slow Down Thread-Index: Acm97fJJ7T0pzmF1TyOGzNThr8ccKwAAHBAG Message-ID: In-Reply-To: <77f4f8890904151015pf17b43bo9e15344b8a5d28a3@mail.gmail.com> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_C60B65474F84rphulariyahooinccom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_C60B65474F84rphulariyahooinccom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Looks like your NameNode is down . Verify if hadoop process are running ( jps should show you all java runni= ng process). If your hadoop process are running try restarting your hadoop process . I guess this problem is due to your fsimage not being correct . You might have to format your namenode. Hope this helps. Thanks, -- Ravi On 4/15/09 10:15 AM, "Mithila Nagendra" wrote: The log file runs into thousands of line with the same message being displayed every time. On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra wrote: > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the > following in it: > > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: > /************************************************************ > STARTUP_MSG: Starting DataNode > STARTUP_MSG: host =3D node19/127.0.0.1 > STARTUP_MSG: args =3D [] > STARTUP_MSG: version =3D 0.18.3 > STARTUP_MSG: build =3D > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009 > ************************************************************/ > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 0 time(s). > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 1 time(s). > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 2 time(s). > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 3 time(s). > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 4 time(s). > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 5 time(s). > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 6 time(s). > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 7 time(s). > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 8 time(s). > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 9 time(s). > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/ > 192.168.0.18:54310 not available yet, Zzzzz... > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 0 time(s). > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 1 time(s). > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 2 time(s). > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 3 time(s). > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 4 time(s). > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 5 time(s). > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 6 time(s). > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 7 time(s). > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 8 time(s). > 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 9 time(s). > 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/ > 192.168.0.18:54310 not available yet, Zzzzz... > 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 0 time(s). > 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 1 time(s). > 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying conne= ct > to server: node18/192.168.0.18:54310. Already tried 2 time(s). > > > Hmmm I still cant figure it out.. > > Mithila > > > On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra wrot= e: > >> Also, Would the way the port is accessed change if all these node are >> connected through a gateway? I mean in the hadoop-site.xml file? The Ubu= ntu >> systems we worked with earlier didnt have a gateway. >> Mithila >> >> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra wrot= e: >> >>> Aaron: Which log file do I look into - there are alot of them. Here s >>> what the error looks like: >>> [mithila@node19:~]$ cd hadoop >>> [mithila@node19:~/hadoop]$ bin/hadoop dfs -ls >>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 0 time(s). >>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 1 time(s). >>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 2 time(s). >>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 3 time(s). >>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 4 time(s). >>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 5 time(s). >>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 6 time(s). >>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 7 time(s). >>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 8 time(s). >>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/ >>> 192.168.0.18:54310. Already tried 9 time(s). >>> Bad connection to FS. command aborted. >>> >>> Node19 is a slave and Node18 is the master. >>> >>> Mithila >>> >>> >>> >>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball wrot= e: >>> >>>> Are there any error messages in the log files on those nodes? >>>> - Aaron >>>> >>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra >>>> wrote: >>>> >>>> > I ve drawn a blank here! Can't figure out what s wrong with the port= s. >>>> I >>>> > can >>>> > ssh between the nodes but cant access the DFS from the slaves - says >>>> "Bad >>>> > connection to DFS". Master seems to be fine. >>>> > Mithila >>>> > >>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra >>>> > wrote: >>>> > >>>> > > Yes I can.. >>>> > > >>>> > > >>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky >>> > >wrote: >>>> > > >>>> > >> Can you ssh between the nodes? >>>> > >> >>>> > >> -jim >>>> > >> >>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra < >>>> mnagendr@asu.edu> >>>> > >> wrote: >>>> > >> >>>> > >> > Thanks Aaron. >>>> > >> > Jim: The three clusters I setup had ubuntu running on them and >>>> the dfs >>>> > >> was >>>> > >> > accessed at port 54310. The new cluster which I ve setup has Re= d >>>> Hat >>>> > >> Linux >>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access the >>>> dfs >>>> > from >>>> > >> > one >>>> > >> > of the slaves i get the following response: dfs cannot be >>>> accessed. >>>> > When >>>> > >> I >>>> > >> > access the DFS throught the master there s no problem. So I fee= l >>>> there >>>> > a >>>> > >> > problem with the port. Any ideas? I did check the list of slave= s, >>>> it >>>> > >> looks >>>> > >> > fine to me. >>>> > >> > >>>> > >> > Mithila >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky < >>>> jim.twensky@gmail.com> >>>> > >> > wrote: >>>> > >> > >>>> > >> > > Mithila, >>>> > >> > > >>>> > >> > > You said all the slaves were being utilized in the 3 node >>>> cluster. >>>> > >> Which >>>> > >> > > application did you run to test that and what was your input >>>> size? >>>> > If >>>> > >> you >>>> > >> > > tried the word count application on a 516 MB input file on bo= th >>>> > >> cluster >>>> > >> > > setups, than some of your nodes in the 15 node cluster may no= t >>>> be >>>> > >> running >>>> > >> > > at >>>> > >> > > all. Generally, one map job is assigned to each input split a= nd >>>> if >>>> > you >>>> > >> > are >>>> > >> > > running your cluster with the defaults, the splits are 64 MB >>>> each. I >>>> > >> got >>>> > >> > > confused when you said the Namenode seemed to do all the work= . >>>> Can >>>> > you >>>> > >> > > check >>>> > >> > > conf/slaves and make sure you put the names of all task >>>> trackers >>>> > >> there? I >>>> > >> > > also suggest comparing both clusters with a larger input size= , >>>> say >>>> > at >>>> > >> > least >>>> > >> > > 5 GB, to really see a difference. >>>> > >> > > >>>> > >> > > Jim >>>> > >> > > >>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball < >>>> aaron@cloudera.com> >>>> > >> > wrote: >>>> > >> > > >>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate th= e >>>> data >>>> > >> and >>>> > >> > > > "sort" >>>> > >> > > > to sort it. >>>> > >> > > > - Aaron >>>> > >> > > > >>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi < >>>> > forpankil@gmail.com> >>>> > >> > > wrote: >>>> > >> > > > >>>> > >> > > > > Your data is too small I guess for 15 clusters ..So it >>>> might be >>>> > >> > > overhead >>>> > >> > > > > time of these clusters making your total MR jobs more tim= e >>>> > >> consuming. >>>> > >> > > > > I guess you will have to try with larger set of data.. >>>> > >> > > > > >>>> > >> > > > > Pankil >>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra < >>>> > >> mnagendr@asu.edu> >>>> > >> > > > > wrote: >>>> > >> > > > > >>>> > >> > > > > > Aaron >>>> > >> > > > > > >>>> > >> > > > > > That could be the issue, my data is just 516MB - wouldn= 't >>>> this >>>> > >> see >>>> > >> > a >>>> > >> > > > bit >>>> > >> > > > > of >>>> > >> > > > > > speed up? >>>> > >> > > > > > Could you guide me to the example? I ll run my cluster = on >>>> it >>>> > and >>>> > >> > see >>>> > >> > > > what >>>> > >> > > > > I >>>> > >> > > > > > get. Also for my program I had a java timer running to >>>> record >>>> > >> the >>>> > >> > > time >>>> > >> > > > > > taken >>>> > >> > > > > > to complete execution. Does Hadoop have an inbuilt time= r? >>>> > >> > > > > > >>>> > >> > > > > > Mithila >>>> > >> > > > > > >>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball < >>>> > >> aaron@cloudera.com >>>> > >> > > >>>> > >> > > > > wrote: >>>> > >> > > > > > >>>> > >> > > > > > > Virtually none of the examples that ship with Hadoop >>>> are >>>> > >> designed >>>> > >> > > to >>>> > >> > > > > > > showcase its speed. Hadoop's speedup comes from its >>>> ability >>>> > to >>>> > >> > > > process >>>> > >> > > > > > very >>>> > >> > > > > > > large volumes of data (starting around, say, tens of = GB >>>> per >>>> > >> job, >>>> > >> > > and >>>> > >> > > > > > going >>>> > >> > > > > > > up in orders of magnitude from there). So if you are >>>> timing >>>> > >> the >>>> > >> > pi >>>> > >> > > > > > > calculator (or something like that), its results won'= t >>>> > >> > necessarily >>>> > >> > > be >>>> > >> > > > > > very >>>> > >> > > > > > > consistent. If a job doesn't have enough fragments of >>>> data >>>> > to >>>> > >> > > > allocate >>>> > >> > > > > > one >>>> > >> > > > > > > per each node, some of the nodes will also just go >>>> unused. >>>> > >> > > > > > > >>>> > >> > > > > > > The best example for you to run is to use randomwrite= r >>>> to >>>> > fill >>>> > >> up >>>> > >> > > > your >>>> > >> > > > > > > cluster with several GB of random data and then run t= he >>>> sort >>>> > >> > > program. >>>> > >> > > > > If >>>> > >> > > > > > > that doesn't scale up performance from 3 nodes to 15, >>>> then >>>> > >> you've >>>> > >> > > > > > > definitely >>>> > >> > > > > > > got something strange going on. >>>> > >> > > > > > > >>>> > >> > > > > > > - Aaron >>>> > >> > > > > > > >>>> > >> > > > > > > >>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra < >>>> > >> > > mnagendr@asu.edu> >>>> > >> > > > > > > wrote: >>>> > >> > > > > > > >>>> > >> > > > > > > > Hey all >>>> > >> > > > > > > > I recently setup a three node hadoop cluster and ra= n >>>> an >>>> > >> > examples >>>> > >> > > on >>>> > >> > > > > it. >>>> > >> > > > > > > It >>>> > >> > > > > > > > was pretty fast, and all the three nodes were being >>>> used >>>> > (I >>>> > >> > > checked >>>> > >> > > > > the >>>> > >> > > > > > > log >>>> > >> > > > > > > > files to make sure that the slaves are utilized). >>>> > >> > > > > > > > >>>> > >> > > > > > > > Now I ve setup another cluster consisting of 15 >>>> nodes. I >>>> > ran >>>> > >> > the >>>> > >> > > > same >>>> > >> > > > > > > > example, but instead of speeding up, the map-reduce >>>> task >>>> > >> seems >>>> > >> > to >>>> > >> > > > > take >>>> > >> > > > > > > > forever! The slaves are not being used for some >>>> reason. >>>> > This >>>> > >> > > second >>>> > >> > > > > > > cluster >>>> > >> > > > > > > > has a lower, per node processing power, but should >>>> that >>>> > make >>>> > >> > any >>>> > >> > > > > > > > difference? >>>> > >> > > > > > > > How can I ensure that the data is being mapped to a= ll >>>> the >>>> > >> > nodes? >>>> > >> > > > > > > Presently, >>>> > >> > > > > > > > the only node that seems to be doing all the work i= s >>>> the >>>> > >> Master >>>> > >> > > > node. >>>> > >> > > > > > > > >>>> > >> > > > > > > > Does 15 nodes in a cluster increase the network cos= t? >>>> What >>>> > >> can >>>> > >> > I >>>> > >> > > do >>>> > >> > > > > to >>>> > >> > > > > > > > setup >>>> > >> > > > > > > > the cluster to function more efficiently? >>>> > >> > > > > > > > >>>> > >> > > > > > > > Thanks! >>>> > >> > > > > > > > Mithila Nagendra >>>> > >> > > > > > > > Arizona State University >>>> > >> > > > > > > > >>>> > >> > > > > > > >>>> > >> > > > > > >>>> > >> > > > > >>>> > >> > > > >>>> > >> > > >>>> > >> > >>>> > >> >>>> > > >>>> > > >>>> > >>>> >>> >>> >> > Ravi -- --_000_C60B65474F84rphulariyahooinccom_--