Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jasonastowe@gmail.com
 designates 209.85.212.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:reply-to:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=k6/lK+r79fGgACZMistC9V5Yc3NjnvvdSuV++RJzvXwaJWPyyGD2NDV/NhvReIWeCo
         YE8GPypbLLoaRei0ad70XTO/uomto5TMrCeE/18SrKBchMLUBXmnNHOr7ZrjuTU9Zpjl
         c4ZKvxq3HbaXTgFRRcl4H0309IFPJIX+Vf8l0=
MIME-Version: 1.0
Sender: jasonastowe@gmail.com
Reply-To: jstowe@cyclecomputing.com
In-Reply-To: <AANLkTinCRc8LsL-628y6NOg99Vr8yfc7oXVe7MJ1L_0M@mail.gmail.com>
References: <AANLkTinmx29W-JgjV5PwUINWOI5AiX8gaMAuKXT2q608@mail.gmail.com>
	<AANLkTikeT2YaxlXjufQ878J-qKPfcNU5Kc3KFj9cu69f@mail.gmail.com>
	<AANLkTiktDEo6ce2olv8G1xboy7dJrlXyxzNxjeD36uVV@mail.gmail.com>
	<AANLkTik15DgWLsyTqYDuLxhQtk4Joz3FkoLd8fxYzSFB@mail.gmail.com>
	<AANLkTilNjBN9fIwlG3xGQ9XnR3XGdir74YQMudEgHD5p@mail.gmail.com>
	<AANLkTinCRc8LsL-628y6NOg99Vr8yfc7oXVe7MJ1L_0M@mail.gmail.com>
Date: Tue, 15 Jun 2010 15:10:16 -0400
Message-ID: <AANLkTikqhXLgTTQe2TnUUeQC9Vy8PFZZ8mJssUhSMXP_@mail.gmail.com>
Subject: Re: Problems with HOD and HDFS
From: Jason Stowe <jstowe@cyclecomputing.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00c09f8a4da81823730489165b8a

--00c09f8a4da81823730489165b8a
Content-Type: text/plain; charset=ISO-8859-1

Hi David,
The original HOD project was integrated with Condor (
http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.

A year or two ago, the Condor project in addition to being open-source w/o
costs for licensing, created close integration with Hadoop (as does SGE), as
presented by me at a prior Hadoop World, and the Condor team at Condor Week
2010:
http://bit.ly/Condor_Hadoop_CondorWeek2010

My company has solutions for deploying Hadoop Clusters on shared
infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
general deployment strategy is to deploy head nodes (Name/Job Tracker), then
execute nodes, and to be careful about how you deal with
data/sizing/replication counts.

If you're interested in this, please feel free to drop us a line at my
e-mail or http://cyclecomputing.com/about/contact

Thanks,
Jason


On Mon, Jun 14, 2010 at 7:45 PM, David Milne <d.n.milne@gmail.com> wrote:

> Unless I am missing something, the Fair Share and Capacity schedulers
> sound like a solution to a different problem: aren't they for a
> dedicated Hadoop cluster that needs to be shared by lots of people? I
> have a general purpose cluster that needs to be shared by lots of
> people. Only one of them (me) wants to run hadoop, and only wants to
> run it  intermittently. I'm not concerned with data locality, as my
> workflow is:
>
> 1) upload data I need to process to cluster
> 2) run a chain of map-reduce tasks
> 3) grab processed data from cluster
> 4) clean up cluster
>
> Mesos sounds good, but I am definitely NOT brave about this. As I
> said, I am just one user of the cluster among many. I would want to
> stick with Torque and Maui for resource management.
>
> - Dave
>
> On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <aaa@cloudera.com> wrote:
> > Dave,
> >
> >  Yes, many others have the same situation, the recommended solution is
> > either to use the Fair Share Scheduler or the Capacity Scheduler. These
> > schedulers are much better than HOD since they take data locality into
> > consideration (they don't just spin up 20 TT nodes on machines that have
> > nothing to do with your data). They also don't lock down the nodes just
> for
> > you, so as TT are freed other jobs can use them immediately (as opposed
> to
> > no body can use them till your entire job is done).
> >
> >  Also, if you are brave and want to try something spanking new, then I
> > recommend you reach out to the Mesos guys, they have a scheduler layer
> under
> > Hadoop that is data locality aware:
> >
> > http://mesos.berkeley.edu/
> >
> > -- amr
> >
> > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d.n.milne@gmail.com>
> wrote:
> >
> >> Ok, thanks Jeff.
> >>
> >> This is pretty surprising though. I would have thought many people
> >> would be in my position, where they have to use Hadoop on a general
> >> purpose cluster, and need it to play nice with a resource manager?
> >> What do other people do in this position, if they don't use HOD?
> >> Deprecated normally means there is a better alternative.
> >>
> >> - Dave
> >>
> >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <hammer@cloudera.com
> >
> >> wrote:
> >> > Hey Dave,
> >> >
> >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> >> don't
> >> > think HOD is actively used or developed anywhere these days. You're
> >> > attempting to use a mostly deprecated project, and hence not receiving
> >> any
> >> > support on the mailing list.
> >> >
> >> > Thanks,
> >> > Jeff
> >> >
> >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
> >> wrote:
> >> >
> >> >> Anybody? I am completely stuck here. I have no idea who else I can
> ask
> >> >> or where I can go for more information. Is there somewhere specific
> >> >> where I should be asking about HOD?
> >> >>
> >> >> Thank you,
> >> >> Dave
> >> >>
> >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
> >> wrote:
> >> >> > Hi there,
> >> >> >
> >> >> > I am trying to get Hadoop on Demand up and running, but am having
> >> >> > problems with the ringmaster not being able to communicate with
> HDFS.
> >> >> >
> >> >> > The output from the hod allocate command ends with this, with full
> >> >> verbosity:
> >> >> >
> >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
> retrieve
> >> >> > 'hdfs' service address.
> >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
> id
> >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be
> allocated.
> >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> >> rm.stop()
> >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> >> >> > cluster /home/dmilne/hadoop/cluster
> >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> >> >> >
> >> >> >
> >> >> > I've attached the hodrc file below, but briefly HOD is supposed to
> >> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and
> seems
> >> >> > to be failing to do so. The ringmaster log looks like this:
> >> >> >
> >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
> >> name:
> >> >> hdfs
> >> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> >> >> > addr hdfs: not found
> >> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
> >> name:
> >> >> hdfs
> >> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> >> >> > addr hdfs: not found
> >> >> >
> >> >> > ... and so on, until it gives up
> >> >> >
> >> >> > Any ideas why? One red flag is that when running the allocate
> command,
> >> >> > some of the variables echo-ed back look dodgy:
> >> >> >
> >> >> > --gridservice-hdfs.fs_port 0
> >> >> > --gridservice-hdfs.host localhost
> >> >> > --gridservice-hdfs.info_port 0
> >> >> >
> >> >> > These are not what I specified in the hodrc. Are the port numbers
> just
> >> >> > set to 0 because I am not using an external HDFS, or is this a
> >> >> > problem?
> >> >> >
> >> >> >
> >> >> > The software versions involved are:
> >> >> >  - Hadoop 0.20.2
> >> >> >  - Python 2.5.2 (no Twisted)
> >> >> >  - Java 1.6.0_20
> >> >> >  - Torque 2.4.5
> >> >> >
> >> >> >
> >> >> > The hodrc file looks like this:
> >> >> >
> >> >> > [hod]
> >> >> > stream                          = True
> >> >> > java-home                       = /opt/jdk1.6.0_20
> >> >> > cluster                         = debian5
> >> >> > cluster-factor                  = 1.8
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 3
> >> >> > allocate-wait-time              = 3600
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> >
> >> >> > [ringmaster]
> >> >> > register                        = True
> >> >> > stream                          = False
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> >> > http-port-range                 = 8000-9000
> >> >> > idleness-limit                  = 864000
> >> >> > work-dirs                       =
> >> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 4
> >> >> >
> >> >> > [hodring]
> >> >> > stream                          = False
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> >> > register                        = True
> >> >> > java-home                       = /opt/jdk1.6.0_20
> >> >> > http-port-range                 = 8000-9000
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 4
> >> >> >
> >> >> > [resource_manager]
> >> >> > queue                           = express
> >> >> > batch-home                      = /opt/torque-2.4.5
> >> >> > id                              = torque
> >> >> > options                         =
> >> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> >> >> > #env-vars                       =
> >> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> >> >> >
> >> >> > [gridservice-mapred]
> >> >> > external                        = False
> >> >> > pkgs                            = /opt/hadoop-0.20.2
> >> >> > tracker_port                    = 8030
> >> >> > info_port                       = 50080
> >> >> >
> >> >> > [gridservice-hdfs]
> >> >> > external                        = False
> >> >> > pkgs                            = /opt/hadoop-0.20.2
> >> >> > fs_port                         = 8020
> >> >> > info_port                       = 50070
> >> >> >
> >> >> > Cheers,
> >> >> > Dave
> >> >> >
> >> >>
> >> >
> >>
> >
>


-- 

==================================
Jason A. Stowe
cell: 607.227.9686
main: 888.292.5320

http://twitter.com/jasonastowe/
http://twitter.com/cyclecomputing/

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com

--00c09f8a4da81823730489165b8a--