hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mithila Nagendra <mnage...@asu.edu>
Subject Re: Map-Reduce Slow Down
Date Wed, 15 Apr 2009 17:15:32 GMT
The log file runs into thousands of line with the same message being
displayed every time.

On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra <mnagendr@asu.edu> wrote:

> The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> following in it:
>
> 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = node19/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> ************************************************************/
> 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Zzzzz...
> 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> 192.168.0.18:54310 not available yet, Zzzzz...
> 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>
>
> Hmmm I still cant figure it out..
>
> Mithila
>
>
> On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <mnagendr@asu.edu>wrote:
>
>> Also, Would the way the port is accessed change if all these node are
>> connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
>> systems we worked with earlier didnt have a gateway.
>> Mithila
>>
>> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <mnagendr@asu.edu>wrote:
>>
>>> Aaron: Which log file do I look into - there are alot of them. Here s
>>> what the error looks like:
>>> [mithila@node19:~]$ cd hadoop
>>> [mithila@node19:~/hadoop]$ bin/hadoop dfs -ls
>>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 0 time(s).
>>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 1 time(s).
>>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 2 time(s).
>>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 3 time(s).
>>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 4 time(s).
>>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 5 time(s).
>>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 6 time(s).
>>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 7 time(s).
>>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 8 time(s).
>>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
>>> 192.168.0.18:54310. Already tried 9 time(s).
>>> Bad connection to FS. command aborted.
>>>
>>> Node19 is a slave and Node18 is the master.
>>>
>>> Mithila
>>>
>>>
>>>
>>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <aaron@cloudera.com>wrote:
>>>
>>>> Are there any error messages in the log files on those nodes?
>>>> - Aaron
>>>>
>>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <mnagendr@asu.edu>
>>>> wrote:
>>>>
>>>> > I ve drawn a blank here! Can't figure out what s wrong with the ports.
>>>> I
>>>> > can
>>>> > ssh between the nodes but cant access the DFS from the slaves - says
>>>> "Bad
>>>> > connection to DFS". Master seems to be fine.
>>>> > Mithila
>>>> >
>>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <mnagendr@asu.edu>
>>>> > wrote:
>>>> >
>>>> > > Yes I can..
>>>> > >
>>>> > >
>>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <jim.twensky@gmail.com
>>>> > >wrote:
>>>> > >
>>>> > >> Can you ssh between the nodes?
>>>> > >>
>>>> > >> -jim
>>>> > >>
>>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <
>>>> mnagendr@asu.edu>
>>>> > >> wrote:
>>>> > >>
>>>> > >> > Thanks Aaron.
>>>> > >> > Jim: The three clusters I setup had ubuntu running on
them and
>>>> the dfs
>>>> > >> was
>>>> > >> > accessed at port 54310. The new cluster which I ve setup
has Red
>>>> Hat
>>>> > >> Linux
>>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access
the
>>>> dfs
>>>> > from
>>>> > >> > one
>>>> > >> > of the slaves i get the following response: dfs cannot
be
>>>> accessed.
>>>> > When
>>>> > >> I
>>>> > >> > access the DFS throught the master there s no problem.
So I feel
>>>> there
>>>> > a
>>>> > >> > problem with the port. Any ideas? I did check the list
of slaves,
>>>> it
>>>> > >> looks
>>>> > >> > fine to me.
>>>> > >> >
>>>> > >> > Mithila
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
>>>> jim.twensky@gmail.com>
>>>> > >> > wrote:
>>>> > >> >
>>>> > >> > > Mithila,
>>>> > >> > >
>>>> > >> > > You said all the slaves were being utilized in the
3 node
>>>> cluster.
>>>> > >> Which
>>>> > >> > > application did you run to test that and what was
your input
>>>> size?
>>>> > If
>>>> > >> you
>>>> > >> > > tried the word count application on a 516 MB input
file on both
>>>> > >> cluster
>>>> > >> > > setups, than some of your nodes in the 15 node cluster
may not
>>>> be
>>>> > >> running
>>>> > >> > > at
>>>> > >> > > all. Generally, one map job is assigned to each input
split and
>>>> if
>>>> > you
>>>> > >> > are
>>>> > >> > > running your cluster with the defaults, the splits
are 64 MB
>>>> each. I
>>>> > >> got
>>>> > >> > > confused when you said the Namenode seemed to do
all the work.
>>>> Can
>>>> > you
>>>> > >> > > check
>>>> > >> > > conf/slaves and make sure you put the names of all
task
>>>> trackers
>>>> > >> there? I
>>>> > >> > > also suggest comparing both clusters with a larger
input size,
>>>> say
>>>> > at
>>>> > >> > least
>>>> > >> > > 5 GB, to really see a difference.
>>>> > >> > >
>>>> > >> > > Jim
>>>> > >> > >
>>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
>>>> aaron@cloudera.com>
>>>> > >> > wrote:
>>>> > >> > >
>>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter"
to generate the
>>>> data
>>>> > >> and
>>>> > >> > > > "sort"
>>>> > >> > > > to sort it.
>>>> > >> > > > - Aaron
>>>> > >> > > >
>>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi
<
>>>> > forpankil@gmail.com>
>>>> > >> > > wrote:
>>>> > >> > > >
>>>> > >> > > > > Your data is too small I guess for 15 clusters
..So it
>>>> might be
>>>> > >> > > overhead
>>>> > >> > > > > time of these clusters making your total
MR jobs more time
>>>> > >> consuming.
>>>> > >> > > > > I guess you will have to try with larger
set of data..
>>>> > >> > > > >
>>>> > >> > > > > Pankil
>>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila
Nagendra <
>>>> > >> mnagendr@asu.edu>
>>>> > >> > > > > wrote:
>>>> > >> > > > >
>>>> > >> > > > > > Aaron
>>>> > >> > > > > >
>>>> > >> > > > > > That could be the issue, my data is
just 516MB - wouldn't
>>>> this
>>>> > >> see
>>>> > >> > a
>>>> > >> > > > bit
>>>> > >> > > > > of
>>>> > >> > > > > > speed up?
>>>> > >> > > > > > Could you guide me to the example?
I ll run my cluster on
>>>> it
>>>> > and
>>>> > >> > see
>>>> > >> > > > what
>>>> > >> > > > > I
>>>> > >> > > > > > get. Also for my program I had a java
timer running to
>>>> record
>>>> > >> the
>>>> > >> > > time
>>>> > >> > > > > > taken
>>>> > >> > > > > > to complete execution. Does Hadoop
have an inbuilt timer?
>>>> > >> > > > > >
>>>> > >> > > > > > Mithila
>>>> > >> > > > > >
>>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron
Kimball <
>>>> > >> aaron@cloudera.com
>>>> > >> > >
>>>> > >> > > > > wrote:
>>>> > >> > > > > >
>>>> > >> > > > > > > Virtually none of the examples
that ship with Hadoop
>>>> are
>>>> > >> designed
>>>> > >> > > to
>>>> > >> > > > > > > showcase its speed. Hadoop's
speedup comes from its
>>>> ability
>>>> > to
>>>> > >> > > > process
>>>> > >> > > > > > very
>>>> > >> > > > > > > large volumes of data (starting
around, say, tens of GB
>>>> per
>>>> > >> job,
>>>> > >> > > and
>>>> > >> > > > > > going
>>>> > >> > > > > > > up in orders of magnitude from
there). So if you are
>>>> timing
>>>> > >> the
>>>> > >> > pi
>>>> > >> > > > > > > calculator (or something like
that), its results won't
>>>> > >> > necessarily
>>>> > >> > > be
>>>> > >> > > > > > very
>>>> > >> > > > > > > consistent. If a job doesn't
have enough fragments of
>>>> data
>>>> > to
>>>> > >> > > > allocate
>>>> > >> > > > > > one
>>>> > >> > > > > > > per each node, some of the nodes
will also just go
>>>> unused.
>>>> > >> > > > > > >
>>>> > >> > > > > > > The best example for you to run
is to use randomwriter
>>>> to
>>>> > fill
>>>> > >> up
>>>> > >> > > > your
>>>> > >> > > > > > > cluster with several GB of random
data and then run the
>>>> sort
>>>> > >> > > program.
>>>> > >> > > > > If
>>>> > >> > > > > > > that doesn't scale up performance
from 3 nodes to 15,
>>>> then
>>>> > >> you've
>>>> > >> > > > > > > definitely
>>>> > >> > > > > > > got something strange going on.
>>>> > >> > > > > > >
>>>> > >> > > > > > > - Aaron
>>>> > >> > > > > > >
>>>> > >> > > > > > >
>>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39
AM, Mithila Nagendra <
>>>> > >> > > mnagendr@asu.edu>
>>>> > >> > > > > > > wrote:
>>>> > >> > > > > > >
>>>> > >> > > > > > > > Hey all
>>>> > >> > > > > > > > I recently setup a three
node hadoop cluster and ran
>>>> an
>>>> > >> > examples
>>>> > >> > > on
>>>> > >> > > > > it.
>>>> > >> > > > > > > It
>>>> > >> > > > > > > > was pretty fast, and all
the three nodes were being
>>>> used
>>>> > (I
>>>> > >> > > checked
>>>> > >> > > > > the
>>>> > >> > > > > > > log
>>>> > >> > > > > > > > files to make sure that
the slaves are utilized).
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Now I ve setup another cluster
consisting of 15
>>>> nodes. I
>>>> > ran
>>>> > >> > the
>>>> > >> > > > same
>>>> > >> > > > > > > > example, but instead of
speeding up, the map-reduce
>>>> task
>>>> > >> seems
>>>> > >> > to
>>>> > >> > > > > take
>>>> > >> > > > > > > > forever! The slaves are
not being used for some
>>>> reason.
>>>> > This
>>>> > >> > > second
>>>> > >> > > > > > > cluster
>>>> > >> > > > > > > > has a lower, per node processing
power, but should
>>>> that
>>>> > make
>>>> > >> > any
>>>> > >> > > > > > > > difference?
>>>> > >> > > > > > > > How can I ensure that the
data is being mapped to all
>>>> the
>>>> > >> > nodes?
>>>> > >> > > > > > > Presently,
>>>> > >> > > > > > > > the only node that seems
to be doing all the work is
>>>> the
>>>> > >> Master
>>>> > >> > > > node.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Does 15 nodes in a cluster
increase the network cost?
>>>> What
>>>> > >> can
>>>> > >> > I
>>>> > >> > > do
>>>> > >> > > > > to
>>>> > >> > > > > > > > setup
>>>> > >> > > > > > > > the cluster to function
more efficiently?
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Thanks!
>>>> > >> > > > > > > > Mithila Nagendra
>>>> > >> > > > > > > > Arizona State University
>>>> > >> > > > > > > >
>>>> > >> > > > > > >
>>>> > >> > > > > >
>>>> > >> > > > >
>>>> > >> > > >
>>>> > >> > >
>>>> > >> >
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message