hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arv Mistry" <...@kindsight.net>
Subject RE: Multiple DataNodes on a single machine
Date Thu, 16 Sep 2010 13:55:58 GMT
Thanks for the responses, I especially appreciate the details Matthew!

Just for the record, I appreciate that having multiple DataNodes on a
single machine defeats the purpose or the advantages given by having
them spread across machines across racks. I intend to go to that model
as we grow.

Cheers Arv

-----Original Message-----
From: Matthew Foley [mailto:mattf@yahoo-inc.com] 
Sent: September 15, 2010 8:45 PM
To: common-user@hadoop.apache.org
Cc: Matthew Foley
Subject: Re: Multiple DataNodes on a single machine

Hello Arv,
It is possible to run multiple datanodes on a single machine, and this
can be useful for small-scale test scenarios.  Also you mentioned in
your previous message that you have a Hadoop implementation with only
one physical datanode server and want to replicate within it, between
spindles.  This also makes sense, and will work.  Of course, if you have
two datanodes running you will get only order-2 replication, not
order-3, even if the replication has been set to 3.

I will describe the config in a moment, but I would first like to point
out that in clusters with even a few datanode servers, one is better off
with cross-server replication.  Without cross-server replication, losing
the System disk will make ALL data volumes unavailable.  And of course,
multiple datanodes running on one server will compete for cores, NICs,
bus, and memory access, even if not for spindles.

A previous responder suggested running two namenodes also, but it wasn't
clear whether he meant two primaries or one primary and one
secondary/checkpoint nameserver.  The latter is fine, but running two
primary namenodes is definitely not the thing to do!

Anyway, here's how you set it up.  I have done this recently with
v0.21.0, with two datanode processes in a single box (along with
namenode sharing the same box), and it did replicate correctly between
the two.  I haven't tried it with > 2 datanodes, and I don't know what
the impact on process efficiency would be, but that would probably work

1. In your HADOOP_HOME directory, copy the "conf" directory to, say,

2. In the conf2 directory, edit as follows:

  a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING,
e.g. ${USER}_02
  b) In hdfs-site.xml, change dfs.data.dir to show the desired
targets/volumes for datanode#2, and of course make sure the
corresponding target directories exist.  Also remove these targets from
the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml.
  c) in hdfs-site.xml, set the four following "address:port" strings to
something non-conflicting with the other datanode and other processes
running on this box:
    - dfs.datanode.address  (default
    - dfs.datanode.ipc.address  (default
    - dfs.datanode.http.address  (default
    - dfs.datanode.https.address  (default
Note: the defaults above are what datanode#1 is probably running on.  I
added 2 to each port number for datanode#2 and it seemed to work okay.
You might also wish to note the default ports associated with the
namenode and job/task tracker processes, in case they are running on the
same box:
    - fs.default.name
    - dfs.http.address
    - dfs.https.address
    - dfs.secondary.http.address
    - mapred.job.tracker.http.address
    - mapred.task.tracker.report.address
    - mapred.task.tracker.http.address

3. At this point, launching with:
    bin/hdfs --config $HADOOP_HOME/conf2 datanode
will work.  To make it convenient to launch as a service, you can add a
couple lines to the end of the bin/start-dfs.sh script like:
    "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config
$HADOOP_CONF_DIR2 --script "$bin"/hdfs start datanode $dataStartOpt

Hope this helps,

On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote:


Is it possible to run multiple data nodes on a single machine? I
currently have a machine with multiple disks and enough disk capacity
for replication across them. I don't need redundancy at the machine
level but would like to be able to handle a single disk failure.

So I was thinking if I can run multiple DataNodes on a single machine
each assigned a separate disk that would give me the protection I need
against disk failure.

Can anyone give me any insights in to how I would setup multiple
DataNodes to run on a single machine? Thanks in advance,

Cheers Arv

View raw message