Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=received:from:to:date:subject:thread-topic:thread-index:
	message-id:in-reply-to:accept-language:content-language:
	x-ms-has-attach:x-ms-tnef-correlator:acceptlanguage:content-type:mime-version;
	b=Yc8sHisjz4JUfqV81plE0ko7Y6SCbtOVehtaLflRYhdM/jJvD/VcKi9ZhUqrM1XH
From: Ravi Phulari <rphulari@yahoo-inc.com>
To: "core-user@hadoop.apache.org" <core-user@hadoop.apache.org>,
        Mithila
 Nagendra <mnagendr@asu.edu>
Date: Wed, 15 Apr 2009 10:19:51 -0700
Subject: Re: Map-Reduce Slow Down
Thread-Topic: Map-Reduce Slow Down
Thread-Index: Acm97fJJ7T0pzmF1TyOGzNThr8ccKwAAHBAG
Message-ID: <C60B6547.4F84%rphulari@yahoo-inc.com>
In-Reply-To: <77f4f8890904151015pf17b43bo9e15344b8a5d28a3@mail.gmail.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C60B65474F84rphulariyahooinccom_"
MIME-Version: 1.0

--_000_C60B65474F84rphulariyahooinccom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Looks like your NameNode is down .
Verify if hadoop process are running (   jps should show you all java runni=
ng process).
If your hadoop process are running try restarting your hadoop process .
I guess this problem is due to your fsimage not being correct .
You might have to format your namenode.
Hope this helps.

Thanks,
--
Ravi


On 4/15/09 10:15 AM, "Mithila Nagendra" <mnagendr@asu.edu> wrote:

The log file runs into thousands of line with the same message being
displayed every time.

On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra <mnagendr@asu.edu> wrote:

> The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> following in it:
>
> 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host =3D node19/127.0.0.1
> STARTUP_MSG:   args =3D []
> STARTUP_MSG:   version =3D 0.18.3
> STARTUP_MSG:   build =3D
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> ************************************************************/
> 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Zzzzz...
> 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Zzzzz...
> 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying conne=
ct
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>
>
> Hmmm I still cant figure it out..
>
> Mithila
>
>
> On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <mnagendr@asu.edu>wrot=
e:
>
>> Also, Would the way the port is accessed change if all these node are
>> connected through a gateway? I mean in the hadoop-site.xml file? The Ubu=
ntu
>> systems we worked with earlier didnt have a gateway.
>> Mithila
>>
>> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <mnagendr@asu.edu>wrot=
e:
>>
>>> Aaron: Which log file do I look into - there are alot of them. Here s
>>> what the error looks like:
>>> [mithila@node19:~]$ cd hadoop
>>> [mithila@node19:~/hadoop]$ bin/hadoop dfs -ls
>>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 0 time(s).
>>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 1 time(s).
>>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 2 time(s).
>>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 3 time(s).
>>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 4 time(s).
>>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 5 time(s).
>>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 6 time(s).
>>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 7 time(s).
>>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 8 time(s).
>>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 9 time(s).
>>> Bad connection to FS. command aborted.
>>>
>>> Node19 is a slave and Node18 is the master.
>>>
>>> Mithila
>>>
>>>
>>>
>>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <aaron@cloudera.com>wrot=
e:
>>>
>>>> Are there any error messages in the log files on those nodes?
>>>> - Aaron
>>>>
>>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <mnagendr@asu.edu>
>>>> wrote:
>>>>
>>>> > I ve drawn a blank here! Can't figure out what s wrong with the port=
s.
>>>> I
>>>> > can
>>>> > ssh between the nodes but cant access the DFS from the slaves - says
>>>> "Bad
>>>> > connection to DFS". Master seems to be fine.
>>>> > Mithila
>>>> >
>>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <mnagendr@asu.edu>
>>>> > wrote:
>>>> >
>>>> > > Yes I can..
>>>> > >
>>>> > >
>>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <jim.twensky@gmail.co=
m
>>>> > >wrote:
>>>> > >
>>>> > >> Can you ssh between the nodes?
>>>> > >>
>>>> > >> -jim
>>>> > >>
>>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <
>>>> mnagendr@asu.edu>
>>>> > >> wrote:
>>>> > >>
>>>> > >> > Thanks Aaron.
>>>> > >> > Jim: The three clusters I setup had ubuntu running on them and
>>>> the dfs
>>>> > >> was
>>>> > >> > accessed at port 54310. The new cluster which I ve setup has Re=
d
>>>> Hat
>>>> > >> Linux
>>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access the
>>>> dfs
>>>> > from
>>>> > >> > one
>>>> > >> > of the slaves i get the following response: dfs cannot be
>>>> accessed.
>>>> > When
>>>> > >> I
>>>> > >> > access the DFS throught the master there s no problem. So I fee=
l
>>>> there
>>>> > a
>>>> > >> > problem with the port. Any ideas? I did check the list of slave=
s,
>>>> it
>>>> > >> looks
>>>> > >> > fine to me.
>>>> > >> >
>>>> > >> > Mithila
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
>>>> jim.twensky@gmail.com>
>>>> > >> > wrote:
>>>> > >> >
>>>> > >> > > Mithila,
>>>> > >> > >
>>>> > >> > > You said all the slaves were being utilized in the 3 node
>>>> cluster.
>>>> > >> Which
>>>> > >> > > application did you run to test that and what was your input
>>>> size?
>>>> > If
>>>> > >> you
>>>> > >> > > tried the word count application on a 516 MB input file on bo=
th
>>>> > >> cluster
>>>> > >> > > setups, than some of your nodes in the 15 node cluster may no=
t
>>>> be
>>>> > >> running
>>>> > >> > > at
>>>> > >> > > all. Generally, one map job is assigned to each input split a=
nd
>>>> if
>>>> > you
>>>> > >> > are
>>>> > >> > > running your cluster with the defaults, the splits are 64 MB
>>>> each. I
>>>> > >> got
>>>> > >> > > confused when you said the Namenode seemed to do all the work=
.
>>>> Can
>>>> > you
>>>> > >> > > check
>>>> > >> > > conf/slaves and make sure you put the names of all task
>>>> trackers
>>>> > >> there? I
>>>> > >> > > also suggest comparing both clusters with a larger input size=
,
>>>> say
>>>> > at
>>>> > >> > least
>>>> > >> > > 5 GB, to really see a difference.
>>>> > >> > >
>>>> > >> > > Jim
>>>> > >> > >
>>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
>>>> aaron@cloudera.com>
>>>> > >> > wrote:
>>>> > >> > >
>>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate th=
e
>>>> data
>>>> > >> and
>>>> > >> > > > "sort"
>>>> > >> > > > to sort it.
>>>> > >> > > > - Aaron
>>>> > >> > > >
>>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
>>>> > forpankil@gmail.com>
>>>> > >> > > wrote:
>>>> > >> > > >
>>>> > >> > > > > Your data is too small I guess for 15 clusters ..So it
>>>> might be
>>>> > >> > > overhead
>>>> > >> > > > > time of these clusters making your total MR jobs more tim=
e
>>>> > >> consuming.
>>>> > >> > > > > I guess you will have to try with larger set of data..
>>>> > >> > > > >
>>>> > >> > > > > Pankil
>>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
>>>> > >> mnagendr@asu.edu>
>>>> > >> > > > > wrote:
>>>> > >> > > > >
>>>> > >> > > > > > Aaron
>>>> > >> > > > > >
>>>> > >> > > > > > That could be the issue, my data is just 516MB - wouldn=
't
>>>> this
>>>> > >> see
>>>> > >> > a
>>>> > >> > > > bit
>>>> > >> > > > > of
>>>> > >> > > > > > speed up?
>>>> > >> > > > > > Could you guide me to the example? I ll run my cluster =
on
>>>> it
>>>> > and
>>>> > >> > see
>>>> > >> > > > what
>>>> > >> > > > > I
>>>> > >> > > > > > get. Also for my program I had a java timer running to
>>>> record
>>>> > >> the
>>>> > >> > > time
>>>> > >> > > > > > taken
>>>> > >> > > > > > to complete execution. Does Hadoop have an inbuilt time=
r?
>>>> > >> > > > > >
>>>> > >> > > > > > Mithila
>>>> > >> > > > > >
>>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
>>>> > >> aaron@cloudera.com
>>>> > >> > >
>>>> > >> > > > > wrote:
>>>> > >> > > > > >
>>>> > >> > > > > > > Virtually none of the examples that ship with Hadoop
>>>> are
>>>> > >> designed
>>>> > >> > > to
>>>> > >> > > > > > > showcase its speed. Hadoop's speedup comes from its
>>>> ability
>>>> > to
>>>> > >> > > > process
>>>> > >> > > > > > very
>>>> > >> > > > > > > large volumes of data (starting around, say, tens of =
GB
>>>> per
>>>> > >> job,
>>>> > >> > > and
>>>> > >> > > > > > going
>>>> > >> > > > > > > up in orders of magnitude from there). So if you are
>>>> timing
>>>> > >> the
>>>> > >> > pi
>>>> > >> > > > > > > calculator (or something like that), its results won'=
t
>>>> > >> > necessarily
>>>> > >> > > be
>>>> > >> > > > > > very
>>>> > >> > > > > > > consistent. If a job doesn't have enough fragments of
>>>> data
>>>> > to
>>>> > >> > > > allocate
>>>> > >> > > > > > one
>>>> > >> > > > > > > per each node, some of the nodes will also just go
>>>> unused.
>>>> > >> > > > > > >
>>>> > >> > > > > > > The best example for you to run is to use randomwrite=
r
>>>> to
>>>> > fill
>>>> > >> up
>>>> > >> > > > your
>>>> > >> > > > > > > cluster with several GB of random data and then run t=
he
>>>> sort
>>>> > >> > > program.
>>>> > >> > > > > If
>>>> > >> > > > > > > that doesn't scale up performance from 3 nodes to 15,
>>>> then
>>>> > >> you've
>>>> > >> > > > > > > definitely
>>>> > >> > > > > > > got something strange going on.
>>>> > >> > > > > > >
>>>> > >> > > > > > > - Aaron
>>>> > >> > > > > > >
>>>> > >> > > > > > >
>>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra <
>>>> > >> > > mnagendr@asu.edu>
>>>> > >> > > > > > > wrote:
>>>> > >> > > > > > >
>>>> > >> > > > > > > > Hey all
>>>> > >> > > > > > > > I recently setup a three node hadoop cluster and ra=
n
>>>> an
>>>> > >> > examples
>>>> > >> > > on
>>>> > >> > > > > it.
>>>> > >> > > > > > > It
>>>> > >> > > > > > > > was pretty fast, and all the three nodes were being
>>>> used
>>>> > (I
>>>> > >> > > checked
>>>> > >> > > > > the
>>>> > >> > > > > > > log
>>>> > >> > > > > > > > files to make sure that the slaves are utilized).
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Now I ve setup another cluster consisting of 15
>>>> nodes. I
>>>> > ran
>>>> > >> > the
>>>> > >> > > > same
>>>> > >> > > > > > > > example, but instead of speeding up, the map-reduce
>>>> task
>>>> > >> seems
>>>> > >> > to
>>>> > >> > > > > take
>>>> > >> > > > > > > > forever! The slaves are not being used for some
>>>> reason.
>>>> > This
>>>> > >> > > second
>>>> > >> > > > > > > cluster
>>>> > >> > > > > > > > has a lower, per node processing power, but should
>>>> that
>>>> > make
>>>> > >> > any
>>>> > >> > > > > > > > difference?
>>>> > >> > > > > > > > How can I ensure that the data is being mapped to a=
ll
>>>> the
>>>> > >> > nodes?
>>>> > >> > > > > > > Presently,
>>>> > >> > > > > > > > the only node that seems to be doing all the work i=
s
>>>> the
>>>> > >> Master
>>>> > >> > > > node.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Does 15 nodes in a cluster increase the network cos=
t?
>>>> What
>>>> > >> can
>>>> > >> > I
>>>> > >> > > do
>>>> > >> > > > > to
>>>> > >> > > > > > > > setup
>>>> > >> > > > > > > > the cluster to function more efficiently?
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Thanks!
>>>> > >> > > > > > > > Mithila Nagendra
>>>> > >> > > > > > > > Arizona State University
>>>> > >> > > > > > > >
>>>> > >> > > > > > >
>>>> > >> > > > > >
>>>> > >> > > > >
>>>> > >> > > >
>>>> > >> > >
>>>> > >> >
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>


Ravi
--


--_000_C60B65474F84rphulariyahooinccom_--