hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Problems with HOD and HDFS
Date Tue, 15 Jun 2010 20:47:52 GMT
On Tue, Jun 15, 2010 at 3:10 PM, Jason Stowe <jstowe@cyclecomputing.com>wrote:

> Hi David,
> The original HOD project was integrated with Condor (
> http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.
>
> A year or two ago, the Condor project in addition to being open-source w/o
> costs for licensing, created close integration with Hadoop (as does SGE),
> as
> presented by me at a prior Hadoop World, and the Condor team at Condor Week
> 2010:
> http://bit.ly/Condor_Hadoop_CondorWeek2010
>
> My company has solutions for deploying Hadoop Clusters on shared
> infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
> general deployment strategy is to deploy head nodes (Name/Job Tracker),
> then
> execute nodes, and to be careful about how you deal with
> data/sizing/replication counts.
>
> If you're interested in this, please feel free to drop us a line at my
> e-mail or http://cyclecomputing.com/about/contact
>
> Thanks,
> Jason
>
>
> On Mon, Jun 14, 2010 at 7:45 PM, David Milne <d.n.milne@gmail.com> wrote:
>
> > Unless I am missing something, the Fair Share and Capacity schedulers
> > sound like a solution to a different problem: aren't they for a
> > dedicated Hadoop cluster that needs to be shared by lots of people? I
> > have a general purpose cluster that needs to be shared by lots of
> > people. Only one of them (me) wants to run hadoop, and only wants to
> > run it  intermittently. I'm not concerned with data locality, as my
> > workflow is:
> >
> > 1) upload data I need to process to cluster
> > 2) run a chain of map-reduce tasks
> > 3) grab processed data from cluster
> > 4) clean up cluster
> >
> > Mesos sounds good, but I am definitely NOT brave about this. As I
> > said, I am just one user of the cluster among many. I would want to
> > stick with Torque and Maui for resource management.
> >
> > - Dave
> >
> > On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <aaa@cloudera.com>
> wrote:
> > > Dave,
> > >
> > >  Yes, many others have the same situation, the recommended solution is
> > > either to use the Fair Share Scheduler or the Capacity Scheduler. These
> > > schedulers are much better than HOD since they take data locality into
> > > consideration (they don't just spin up 20 TT nodes on machines that
> have
> > > nothing to do with your data). They also don't lock down the nodes just
> > for
> > > you, so as TT are freed other jobs can use them immediately (as opposed
> > to
> > > no body can use them till your entire job is done).
> > >
> > >  Also, if you are brave and want to try something spanking new, then I
> > > recommend you reach out to the Mesos guys, they have a scheduler layer
> > under
> > > Hadoop that is data locality aware:
> > >
> > > http://mesos.berkeley.edu/
> > >
> > > -- amr
> > >
> > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d.n.milne@gmail.com>
> > wrote:
> > >
> > >> Ok, thanks Jeff.
> > >>
> > >> This is pretty surprising though. I would have thought many people
> > >> would be in my position, where they have to use Hadoop on a general
> > >> purpose cluster, and need it to play nice with a resource manager?
> > >> What do other people do in this position, if they don't use HOD?
> > >> Deprecated normally means there is a better alternative.
> > >>
> > >> - Dave
> > >>
> > >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <
> hammer@cloudera.com
> > >
> > >> wrote:
> > >> > Hey Dave,
> > >> >
> > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA,
I
> > >> don't
> > >> > think HOD is actively used or developed anywhere these days. You're
> > >> > attempting to use a mostly deprecated project, and hence not
> receiving
> > >> any
> > >> > support on the mailing list.
> > >> >
> > >> > Thanks,
> > >> > Jeff
> > >> >
> > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Anybody? I am completely stuck here. I have no idea who else I
can
> > ask
> > >> >> or where I can go for more information. Is there somewhere specific
> > >> >> where I should be asking about HOD?
> > >> >>
> > >> >> Thank you,
> > >> >> Dave
> > >> >>
> > >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
> > >> wrote:
> > >> >> > Hi there,
> > >> >> >
> > >> >> > I am trying to get Hadoop on Demand up and running, but am
having
> > >> >> > problems with the ringmaster not being able to communicate
with
> > HDFS.
> > >> >> >
> > >> >> > The output from the hod allocate command ends with this,
with
> full
> > >> >> verbosity:
> > >> >> >
> > >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed
to
> > retrieve
> > >> >> > 'hdfs' service address.
> > >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning
up
> cluster
> > id
> > >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be
> > allocated.
> > >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> > >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning
from
> > >> rm.stop()
> > >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> > >> >> > cluster /home/dmilne/hadoop/cluster
> > >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code:
7
> > >> >> >
> > >> >> >
> > >> >> > I've attached the hodrc file below, but briefly HOD is supposed
> to
> > >> >> > provision an HDFS cluster as well as a Map/Reduce cluster,
and
> > seems
> > >> >> > to be failing to do so. The ringmaster log looks like this:
> > >> >> >
> > >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 -
> getServiceAddr
> > >> name:
> > >> >> hdfs
> > >> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 -
> getServiceAddr
> > >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 -
> getServiceAddr
> > >> >> > addr hdfs: not found
> > >> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 -
> getServiceAddr
> > >> name:
> > >> >> hdfs
> > >> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 -
> getServiceAddr
> > >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 -
> getServiceAddr
> > >> >> > addr hdfs: not found
> > >> >> >
> > >> >> > ... and so on, until it gives up
> > >> >> >
> > >> >> > Any ideas why? One red flag is that when running the allocate
> > command,
> > >> >> > some of the variables echo-ed back look dodgy:
> > >> >> >
> > >> >> > --gridservice-hdfs.fs_port 0
> > >> >> > --gridservice-hdfs.host localhost
> > >> >> > --gridservice-hdfs.info_port 0
> > >> >> >
> > >> >> > These are not what I specified in the hodrc. Are the port
numbers
> > just
> > >> >> > set to 0 because I am not using an external HDFS, or is this
a
> > >> >> > problem?
> > >> >> >
> > >> >> >
> > >> >> > The software versions involved are:
> > >> >> >  - Hadoop 0.20.2
> > >> >> >  - Python 2.5.2 (no Twisted)
> > >> >> >  - Java 1.6.0_20
> > >> >> >  - Torque 2.4.5
> > >> >> >
> > >> >> >
> > >> >> > The hodrc file looks like this:
> > >> >> >
> > >> >> > [hod]
> > >> >> > stream                          = True
> > >> >> > java-home                       = /opt/jdk1.6.0_20
> > >> >> > cluster                         = debian5
> > >> >> > cluster-factor                  = 1.8
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 3
> > >> >> > allocate-wait-time              = 3600
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> >
> > >> >> > [ringmaster]
> > >> >> > register                        = True
> > >> >> > stream                          = False
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> >> > http-port-range                 = 8000-9000
> > >> >> > idleness-limit                  = 864000
> > >> >> > work-dirs                       =
> > >> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 4
> > >> >> >
> > >> >> > [hodring]
> > >> >> > stream                          = False
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> >> > register                        = True
> > >> >> > java-home                       = /opt/jdk1.6.0_20
> > >> >> > http-port-range                 = 8000-9000
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 4
> > >> >> >
> > >> >> > [resource_manager]
> > >> >> > queue                           = express
> > >> >> > batch-home                      = /opt/torque-2.4.5
> > >> >> > id                              = torque
> > >> >> > options                         =
> > >> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> > >> >> > #env-vars                       =
> > >> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> > >> >> >
> > >> >> > [gridservice-mapred]
> > >> >> > external                        = False
> > >> >> > pkgs                            = /opt/hadoop-0.20.2
> > >> >> > tracker_port                    = 8030
> > >> >> > info_port                       = 50080
> > >> >> >
> > >> >> > [gridservice-hdfs]
> > >> >> > external                        = False
> > >> >> > pkgs                            = /opt/hadoop-0.20.2
> > >> >> > fs_port                         = 8020
> > >> >> > info_port                       = 50070
> > >> >> >
> > >> >> > Cheers,
> > >> >> > Dave
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>
>
>
> --
>
> ==================================
> Jason A. Stowe
> cell: 607.227.9686
> main: 888.292.5320
>
> http://twitter.com/jasonastowe/
> http://twitter.com/cyclecomputing/
>
> Cycle Computing, LLC
> Leader in Open Compute Solutions for Clouds, Servers, and Desktops
> Enterprise Condor Support and Management Tools
>
> http://www.cyclecomputing.com
> http://www.cyclecloud.com
>

>>but I don't follow how this stretches out to using multiple machines
allocated by Torque.

Hadoop does not have a concept of VirutalHosting NameNode has a port,
jobtracker has a port, DataNode users a port, and has a port for the web
interface, task tracker is the same deal. Running multiple copies of hadoop
on the same machine is "easy". All you have to do is make sure they do not
step on each other. Make sure they do not write to the same folder
locations, make sure they do not use the same ports.

Single setup
NameNode: 9000 Web: 50070
JobTracker: 1000 Web: 50030
...

Multi Setup

Setup 1
NameNode: 9001 Web: 50071
JobTracker: 1001 Web: 50031
...

Setup2
NameNode: 9002 Web: 50072
JobTracker: 1002 Web: 50032
...

HOD is supposed to handle the "dirty" work for you of building configuration
files, installing hadoop to the nodes, starting the hadoop components. You
could theoretically accomplish similar things with remote SSH keys, and a
boatload of scripting. HOD is a deployment and management tool.

It sounds like it may not meet your need. Is your goal to just deploy and
manage one instance of Hadoop or multiple instances? HOD is designed to
install multiple instances of hadoop on a single set of hardware. It sounds
like you want to deploy one cluster per group of VM's which is not really
the same thing.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message