hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kartashov, Andy" <Andy.Kartas...@mpac.ca>
Subject Hadoop - cluster set-up (for DUMMIES)... or how I did it
Date Fri, 02 Nov 2012 16:35:35 GMT
Hello Hadoopers,

After weeks of struggle, numerous error debugging and the like I finally managed to set-up
a fully distributed cluster. I decided to share my experience with the new comers.
 In case the experts on here disagree with some of the facts mentioned here-in feel free to
correct or add your comments.

Example Cluster Topology:
Node 1 – NameNode+JobTracker
Node 2 – SecondaryNameNode
Node 3, 4, .., N – DataNodes 1,2,..N+TaskTrackers 1,2,..N

Configuration set-up after you installed Hadoop:

Firstly, you will need to find every host address of your respective Node by running:
$hostname –f

Your /etc/hadoop/ folder contains subfolders of your configuration files.  Your installation
will create a default folder conf.empty. Copy it to, say conf.cluster and make sure your soft
link conf-> points to conf.cluster

You can see what it points now to by running:
$ alternatives --display hadoop-conf

Make a new link and set it to point to conf.cluster:
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cluster
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
Run the display again to check proper configuration
$ alternatives --display hadoop-conf

Let’s go inside conf.cluster
$cd conf.cluster/

As a minimum, we will need to modify the following files:
1.      core-site.xml
    <value>hdfs://<host-name>/:8020/</value> # it is the host-name of your
NameNode -Node1 which you found with “hostname –f” above

2.      mapred-site.xml
    <!--<value><host-name>:8021</value> --> # it is host-name of your
NameNode – Node 1  as well, since we intend to run NameNode and JobTracker on the same machine

3.      masters # if this file doesn’t exist yet, create it and add one line:
<host-name> # it is the host-name of your Node2 – running SecondaryNameNode

4.      slaves # if this file doesn’t exist yet, create it and add your host-names ( one
per line):
<host-name> # it is the host-name of your Node3 – running DataNode1
<host-name> # it is the host-name of your Node4 – running DataNode2
<host-name> # it is the host-name of your NodeN – running DataNodeN

5.      If you are not comfortable touching hdfs-site.xml, no problem, after you format your
NameNode, it will create dfs/name dfs/data etc. folder structure in your local Linux default
/tmp/hadoop-hdfs/directory. You could later change this to a different folder by specifying
hdfs-site.xml  but please learn on the file structure/permissions/owners of those directories
/dfs/data dfs/name dfs/namesecondary etc that were created for you by default first.

Let’s format HDFS namespace: (note we format it as hdfs user)
$ sudo –u hdfs hadoop  namenode –format
NOTE – that you only run this command ONCE on the NameNode only!

I only added the following property to my hdfs-site.xml on the NameNode- Node1 for the SecondaryNameNode
to use:

  <value>namenode.host.address:50070</value>   # I change this to
for EC2 environment
    Needed for running SNN
    The address and the base port on which the dfs NameNode Web UI will listen.
    If the port is 0, the server will start on a free port.
</property>other SNN properties for hdfs-site.xml

6.      Copy you /conf.cluster/folder to every Node in your cluster: Node2 (SNN) , Node3,4,..N
(DNs+TTs). Make sure your conf soft link points to tis directory (see above).

7.              Now we ready to start daemons:

        Everytime you start a respective Daemon, a log report is written.  This is the FIRST
place to look for potential problems.
Unless you change the property in hadoop-env.sh, found in your /conf/conf.cluster/ directory,
namely “#export HADOOP_LOG_DIR=/foor/bar/whatever”   the default logs are written on each
respective Node to:
NameNode, DataNode, SecondaryNameNode – “/var/log/hadoop-hdfs/” directory
JobTracker,TaskTracker- “/var/log/hadoop-mapreduce” or “/var/log/hadoop-0.20-mapreduce/”
or else, depending on the version of your MR. Respective Daemon will have a respective filename
ending with .log

                I came across a lot of errors playing with this, as follows:
a.      Error: connection refused
This is normally caused by your firewall. Try running “sudo /etc/init.d/iptables status”.
 I bet it is running. Solution: either add allowed ports or temporarily turn off iptables
by running “sudo service iptables stop”
Try to restart your Daemon (that is refused connection) and check your respective /var/log/….
Datanode or TaskTracker or else .log file again.
This solved my problems with connections. You can test connection by running  “telnet <ip-address>
<port>” of the Node you are trying to connect to.
b.      Binding exception.
This happens when you try to start a Daemon on the machine that is not supposed to run this
Daemon. For example,  trying to start JobTracker on a slave machine.  This is a given.  JobTracker
is already running on your MasterNode -  Node1 hence the binding Exception.
c.      Java heap size or Java Child exception were thrown when I ran too small of an instance
on EC2. Increasing it from tiny to small or from small to medium, solved the issue.
d.      DataNode running on slave throws an Exception about DataNode id –mismatch. This
happened when I tried to duplicate an instance on EC2, and as a result ended up with two different
DataNodes with the same id. Deleting /tmp/hadoop-hdfs/dfs/data directory on the replicated
Instance and restarting dataNode Daemon solved this issue.
Now, that you fixed your above errors and restarted respective Daemons your ..log files should
be clean of any errors.

Let’s now check that all of our DataNodes1,2-N (Nodes3,4…N) are registered with the Master
Namenode - Node1.
“$hadoop dfsadmin –printTopology”
Should display all your host-addresses you mentioned in the /conf.cluster/slaves file.

8.      Let’s create some structure inside hdfs:
 Very IMPORTANT to Create the HDFS /tmp Directory. Create it AFTER HDFS is up and running
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

 Create MapReduce /var directories (YARN requires different structure)
sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /

You should see:
drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Create a Home Directory for each MapReduce User
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for
$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.

p.s. whenever you need to add more Nodes running DataNode/TaskTracker:
1. check your firewall (iptables) if running and what ports are allowed
2. add hostname (by running "$hostname -f") inside your /conf/conf.cluster/slaves on NameNode1
3. start DataNode + TaskTracker on the newly added Node
4. restart DataNode / JobTracker on your NameNode1
5. Check that your DataNode registered by running "hadoop dfsadmin -printTopology".
6. If I am duplicating an instance on EC2 currently running DataNode, before I start above
two Daemons I make sure I delete  data inside /var/log/hadoop-hdfs, /var/log/hadoop-mapreduce
and /tmp/hadoop-hdfs folders. Starting DataNode and TaskTracker Daemon will recreate new files

Happy Hadooping.
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and
may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not
the intended recipient, please delete and contact the sender immediately. Please consider
the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe
qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts
par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite.
Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
View raw message