hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Map-Reduce Slow Down
Date Wed, 15 Apr 2009 17:59:02 GMT
Hi,

I wrote a blog post a while back about connecting nodes via a gateway. See
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/

This assumes that the client is outside the gateway and all
datanodes/namenode are inside, but the same principles apply. You'll just
need to set up ssh tunnels from every datanode to the namenode.

- Aaron

On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari <rphulari@yahoo-inc.com>wrote:

> Looks like your NameNode is down .
> Verify if hadoop process are running (   jps should show you all java
> running process).
> If your hadoop process are running try restarting your hadoop process .
> I guess this problem is due to your fsimage not being correct .
> You might have to format your namenode.
> Hope this helps.
>
> Thanks,
> --
> Ravi
>
>
> On 4/15/09 10:15 AM, "Mithila Nagendra" <mnagendr@asu.edu> wrote:
>
> The log file runs into thousands of line with the same message being
> displayed every time.
>
> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra <mnagendr@asu.edu>
> wrote:
>
> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> > following in it:
> >
> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
> > /************************************************************
> > STARTUP_MSG: Starting DataNode
> > STARTUP_MSG:   host = node19/127.0.0.1
> > STARTUP_MSG:   args = []
> > STARTUP_MSG:   version = 0.18.3
> > STARTUP_MSG:   build =
> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> > ************************************************************/
> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> > 192.168.0.18:54310 not available yet, Zzzzz...
> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> > 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> > 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
> > 192.168.0.18:54310 not available yet, Zzzzz...
> > 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> > 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> > 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >
> >
> > Hmmm I still cant figure it out..
> >
> > Mithila
> >
> >
> > On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <mnagendr@asu.edu
> >wrote:
> >
> >> Also, Would the way the port is accessed change if all these node are
> >> connected through a gateway? I mean in the hadoop-site.xml file? The
> Ubuntu
> >> systems we worked with earlier didnt have a gateway.
> >> Mithila
> >>
> >> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <mnagendr@asu.edu
> >wrote:
> >>
> >>> Aaron: Which log file do I look into - there are alot of them. Here s
> >>> what the error looks like:
> >>> [mithila@node19:~]$ cd hadoop
> >>> [mithila@node19:~/hadoop]$ bin/hadoop dfs -ls
> >>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 0 time(s).
> >>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 1 time(s).
> >>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 2 time(s).
> >>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 3 time(s).
> >>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 4 time(s).
> >>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 5 time(s).
> >>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 6 time(s).
> >>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 7 time(s).
> >>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 8 time(s).
> >>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
> >>> 192.168.0.18:54310. Already tried 9 time(s).
> >>> Bad connection to FS. command aborted.
> >>>
> >>> Node19 is a slave and Node18 is the master.
> >>>
> >>> Mithila
> >>>
> >>>
> >>>
> >>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <aaron@cloudera.com
> >wrote:
> >>>
> >>>> Are there any error messages in the log files on those nodes?
> >>>> - Aaron
> >>>>
> >>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <mnagendr@asu.edu>
> >>>> wrote:
> >>>>
> >>>> > I ve drawn a blank here! Can't figure out what s wrong with the
> ports.
> >>>> I
> >>>> > can
> >>>> > ssh between the nodes but cant access the DFS from the slaves -
says
> >>>> "Bad
> >>>> > connection to DFS". Master seems to be fine.
> >>>> > Mithila
> >>>> >
> >>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <mnagendr@asu.edu
> >
> >>>> > wrote:
> >>>> >
> >>>> > > Yes I can..
> >>>> > >
> >>>> > >
> >>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <
> jim.twensky@gmail.com
> >>>> > >wrote:
> >>>> > >
> >>>> > >> Can you ssh between the nodes?
> >>>> > >>
> >>>> > >> -jim
> >>>> > >>
> >>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <
> >>>> mnagendr@asu.edu>
> >>>> > >> wrote:
> >>>> > >>
> >>>> > >> > Thanks Aaron.
> >>>> > >> > Jim: The three clusters I setup had ubuntu running
on them and
> >>>> the dfs
> >>>> > >> was
> >>>> > >> > accessed at port 54310. The new cluster which I ve
setup has
> Red
> >>>> Hat
> >>>> > >> Linux
> >>>> > >> > release 7.2 (Enigma)running on it. Now when I try
to access the
> >>>> dfs
> >>>> > from
> >>>> > >> > one
> >>>> > >> > of the slaves i get the following response: dfs cannot
be
> >>>> accessed.
> >>>> > When
> >>>> > >> I
> >>>> > >> > access the DFS throught the master there s no problem.
So I
> feel
> >>>> there
> >>>> > a
> >>>> > >> > problem with the port. Any ideas? I did check the
list of
> slaves,
> >>>> it
> >>>> > >> looks
> >>>> > >> > fine to me.
> >>>> > >> >
> >>>> > >> > Mithila
> >>>> > >> >
> >>>> > >> >
> >>>> > >> >
> >>>> > >> >
> >>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
> >>>> jim.twensky@gmail.com>
> >>>> > >> > wrote:
> >>>> > >> >
> >>>> > >> > > Mithila,
> >>>> > >> > >
> >>>> > >> > > You said all the slaves were being utilized
in the 3 node
> >>>> cluster.
> >>>> > >> Which
> >>>> > >> > > application did you run to test that and what
was your input
> >>>> size?
> >>>> > If
> >>>> > >> you
> >>>> > >> > > tried the word count application on a 516 MB
input file on
> both
> >>>> > >> cluster
> >>>> > >> > > setups, than some of your nodes in the 15 node
cluster may
> not
> >>>> be
> >>>> > >> running
> >>>> > >> > > at
> >>>> > >> > > all. Generally, one map job is assigned to each
input split
> and
> >>>> if
> >>>> > you
> >>>> > >> > are
> >>>> > >> > > running your cluster with the defaults, the
splits are 64 MB
> >>>> each. I
> >>>> > >> got
> >>>> > >> > > confused when you said the Namenode seemed to
do all the
> work.
> >>>> Can
> >>>> > you
> >>>> > >> > > check
> >>>> > >> > > conf/slaves and make sure you put the names
of all task
> >>>> trackers
> >>>> > >> there? I
> >>>> > >> > > also suggest comparing both clusters with a
larger input
> size,
> >>>> say
> >>>> > at
> >>>> > >> > least
> >>>> > >> > > 5 GB, to really see a difference.
> >>>> > >> > >
> >>>> > >> > > Jim
> >>>> > >> > >
> >>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball
<
> >>>> aaron@cloudera.com>
> >>>> > >> > wrote:
> >>>> > >> > >
> >>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter"
to generate
> the
> >>>> data
> >>>> > >> and
> >>>> > >> > > > "sort"
> >>>> > >> > > > to sort it.
> >>>> > >> > > > - Aaron
> >>>> > >> > > >
> >>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil
Doshi <
> >>>> > forpankil@gmail.com>
> >>>> > >> > > wrote:
> >>>> > >> > > >
> >>>> > >> > > > > Your data is too small I guess for
15 clusters ..So it
> >>>> might be
> >>>> > >> > > overhead
> >>>> > >> > > > > time of these clusters making your
total MR jobs more
> time
> >>>> > >> consuming.
> >>>> > >> > > > > I guess you will have to try with
larger set of data..
> >>>> > >> > > > >
> >>>> > >> > > > > Pankil
> >>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila
Nagendra <
> >>>> > >> mnagendr@asu.edu>
> >>>> > >> > > > > wrote:
> >>>> > >> > > > >
> >>>> > >> > > > > > Aaron
> >>>> > >> > > > > >
> >>>> > >> > > > > > That could be the issue, my data
is just 516MB -
> wouldn't
> >>>> this
> >>>> > >> see
> >>>> > >> > a
> >>>> > >> > > > bit
> >>>> > >> > > > > of
> >>>> > >> > > > > > speed up?
> >>>> > >> > > > > > Could you guide me to the example?
I ll run my cluster
> on
> >>>> it
> >>>> > and
> >>>> > >> > see
> >>>> > >> > > > what
> >>>> > >> > > > > I
> >>>> > >> > > > > > get. Also for my program I had
a java timer running to
> >>>> record
> >>>> > >> the
> >>>> > >> > > time
> >>>> > >> > > > > > taken
> >>>> > >> > > > > > to complete execution. Does Hadoop
have an inbuilt
> timer?
> >>>> > >> > > > > >
> >>>> > >> > > > > > Mithila
> >>>> > >> > > > > >
> >>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13
AM, Aaron Kimball <
> >>>> > >> aaron@cloudera.com
> >>>> > >> > >
> >>>> > >> > > > > wrote:
> >>>> > >> > > > > >
> >>>> > >> > > > > > > Virtually none of the examples
that ship with Hadoop
> >>>> are
> >>>> > >> designed
> >>>> > >> > > to
> >>>> > >> > > > > > > showcase its speed. Hadoop's
speedup comes from its
> >>>> ability
> >>>> > to
> >>>> > >> > > > process
> >>>> > >> > > > > > very
> >>>> > >> > > > > > > large volumes of data (starting
around, say, tens of
> GB
> >>>> per
> >>>> > >> job,
> >>>> > >> > > and
> >>>> > >> > > > > > going
> >>>> > >> > > > > > > up in orders of magnitude
from there). So if you are
> >>>> timing
> >>>> > >> the
> >>>> > >> > pi
> >>>> > >> > > > > > > calculator (or something
like that), its results
> won't
> >>>> > >> > necessarily
> >>>> > >> > > be
> >>>> > >> > > > > > very
> >>>> > >> > > > > > > consistent. If a job doesn't
have enough fragments of
> >>>> data
> >>>> > to
> >>>> > >> > > > allocate
> >>>> > >> > > > > > one
> >>>> > >> > > > > > > per each node, some of the
nodes will also just go
> >>>> unused.
> >>>> > >> > > > > > >
> >>>> > >> > > > > > > The best example for you
to run is to use
> randomwriter
> >>>> to
> >>>> > fill
> >>>> > >> up
> >>>> > >> > > > your
> >>>> > >> > > > > > > cluster with several GB
of random data and then run
> the
> >>>> sort
> >>>> > >> > > program.
> >>>> > >> > > > > If
> >>>> > >> > > > > > > that doesn't scale up performance
from 3 nodes to 15,
> >>>> then
> >>>> > >> you've
> >>>> > >> > > > > > > definitely
> >>>> > >> > > > > > > got something strange going
on.
> >>>> > >> > > > > > >
> >>>> > >> > > > > > > - Aaron
> >>>> > >> > > > > > >
> >>>> > >> > > > > > >
> >>>> > >> > > > > > > On Sun, Apr 12, 2009 at
8:39 AM, Mithila Nagendra <
> >>>> > >> > > mnagendr@asu.edu>
> >>>> > >> > > > > > > wrote:
> >>>> > >> > > > > > >
> >>>> > >> > > > > > > > Hey all
> >>>> > >> > > > > > > > I recently setup a
three node hadoop cluster and
> ran
> >>>> an
> >>>> > >> > examples
> >>>> > >> > > on
> >>>> > >> > > > > it.
> >>>> > >> > > > > > > It
> >>>> > >> > > > > > > > was pretty fast, and
all the three nodes were being
> >>>> used
> >>>> > (I
> >>>> > >> > > checked
> >>>> > >> > > > > the
> >>>> > >> > > > > > > log
> >>>> > >> > > > > > > > files to make sure
that the slaves are utilized).
> >>>> > >> > > > > > > >
> >>>> > >> > > > > > > > Now I ve setup another
cluster consisting of 15
> >>>> nodes. I
> >>>> > ran
> >>>> > >> > the
> >>>> > >> > > > same
> >>>> > >> > > > > > > > example, but instead
of speeding up, the map-reduce
> >>>> task
> >>>> > >> seems
> >>>> > >> > to
> >>>> > >> > > > > take
> >>>> > >> > > > > > > > forever! The slaves
are not being used for some
> >>>> reason.
> >>>> > This
> >>>> > >> > > second
> >>>> > >> > > > > > > cluster
> >>>> > >> > > > > > > > has a lower, per node
processing power, but should
> >>>> that
> >>>> > make
> >>>> > >> > any
> >>>> > >> > > > > > > > difference?
> >>>> > >> > > > > > > > How can I ensure that
the data is being mapped to
> all
> >>>> the
> >>>> > >> > nodes?
> >>>> > >> > > > > > > Presently,
> >>>> > >> > > > > > > > the only node that
seems to be doing all the work
> is
> >>>> the
> >>>> > >> Master
> >>>> > >> > > > node.
> >>>> > >> > > > > > > >
> >>>> > >> > > > > > > > Does 15 nodes in a
cluster increase the network
> cost?
> >>>> What
> >>>> > >> can
> >>>> > >> > I
> >>>> > >> > > do
> >>>> > >> > > > > to
> >>>> > >> > > > > > > > setup
> >>>> > >> > > > > > > > the cluster to function
more efficiently?
> >>>> > >> > > > > > > >
> >>>> > >> > > > > > > > Thanks!
> >>>> > >> > > > > > > > Mithila Nagendra
> >>>> > >> > > > > > > > Arizona State University
> >>>> > >> > > > > > > >
> >>>> > >> > > > > > >
> >>>> > >> > > > > >
> >>>> > >> > > > >
> >>>> > >> > > >
> >>>> > >> > >
> >>>> > >> >
> >>>> > >>
> >>>> > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>
>
> Ravi
> --
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message