hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Newbie questions on H adoop local directories?
Date Sun, 05 Apr 2009 17:50:56 GMT
On Sun, Apr 5, 2009 at 1:14 AM, Foss User <fossist@gmail.com> wrote:

> I am trying to learn Hadoop and a lot of questions come to my mind
> when I try to learn it. So, I will be asking a few questions here from
> time to time until I feel completely comfortable with it. Here are
> some questions now:
> 1. Is it true that Hadoop should be installed on the same location on
> all Linux machines? As per what I have understood, it is necessary to
> install them on the same machine on all nodes as if I am going to use
> bin/start-dfs.sh and bin/start-mapred.sh to start the data nodes and
> task trackers on all slaves. Otherwise, it is not required. How
> correct I am?

That's correct. To use those scripts, the "hadoop" script needs to be in the
same location. The different machines could theoretically have different
hadoop-site.xml files, though, which pointed dfs.name.dir to different
locations. This makes management a bit trickier, but is useful if you have
different disk setups on different machines.

> 2. Say, a slave goes down (due to network problems or power cut) while
> a word count job was going on. When it comes up again, what are the
> tasks I need to do? bin/hadoop-daemon.sh start datanode and
> bin/hadoop-daemon.sh start tasktracker is enough for recovery? Do, I
> have to delete any /tmp/hadoop-hadoop directories before starting? Is
> it guaranteed that on starting, any corrupt files in tmp directory
> would be discarded and everything would be restored to normalcy?

Yes - just starting the daemons should be enough. They'll clean up their
temporary files on their own.

> 3. Say, I have 1 master and 4 slaves and I start datanode on 2 slaves
> and tasktracker on the other two. I put files in the HDFS. it means
> that the files would be stored in the first two datanodes. Then I run
> a word count job. This means that the word count jobs would run on the
> two task trackers. How would the two task trackers now get the files
> to do the word counting? In the documentations I was reading that the
> jobs are run on those nodes which have the data. but in this setup,
> the data nodes and job trackers are separate. So, how will the word
> count job do its work?

Hadoop will *try* to schedule jobs with data locality in mind, but if that's
impossible, it will read data off of remote nodes. Even when a task is being
run data-local, it uses the same TCP-based protocol to get data off the
datanode (this is something that is currently being worked on) Data-locality
is an optimization to avoid network IO, but not necessary.

FYI, you shouldn't run with fewer than 3 datanodes with the default
configuration. This may be the source of some of your problems in other
messages youv'e sent recently. The default value for dfs.replication in
hadoop-default.xml is 3, meaning that it will try to place blocks on at
least 3 machines. If there are only 2 machines up, all of your blocks by
definition will be under-replicated, and your cluster will be somewhat


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message