Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 54006 invoked from network); 15 Jun 2010 19:10:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Jun 2010 19:10:51 -0000 Received: (qmail 73635 invoked by uid 500); 15 Jun 2010 19:10:48 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 73594 invoked by uid 500); 15 Jun 2010 19:10:47 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 73586 invoked by uid 99); 15 Jun 2010 19:10:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jun 2010 19:10:47 +0000 X-ASF-Spam-Status: No, hits=-1.1 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jasonastowe@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jun 2010 19:10:43 +0000 Received: by vws4 with SMTP id 4so7050905vws.35 for ; Tue, 15 Jun 2010 12:10:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:reply-to:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=DA2em2b5QS1Bq34uok0AfQ7KB3enPcqNVRky7CojnII=; b=V0xDLbYHQiRez+TVeHGzWaFIYmZ3G6/e0pt1+hPYDMWBJ2piqYdzFsR2wTiY6oEXJ9 /JerpCzagjudHMCu2U9/Ho2EQjASdio6Bw5m/S9Or4hfd1rAmRL8dmwq5R5phjXf+0ay EfWV0zDqSj+XqGcb0rPlUq7wjalr+HIq/bwGA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:reply-to:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=k6/lK+r79fGgACZMistC9V5Yc3NjnvvdSuV++RJzvXwaJWPyyGD2NDV/NhvReIWeCo YE8GPypbLLoaRei0ad70XTO/uomto5TMrCeE/18SrKBchMLUBXmnNHOr7ZrjuTU9Zpjl c4ZKvxq3HbaXTgFRRcl4H0309IFPJIX+Vf8l0= MIME-Version: 1.0 Received: by 10.224.26.193 with SMTP id f1mr3474712qac.83.1276629017232; Tue, 15 Jun 2010 12:10:17 -0700 (PDT) Sender: jasonastowe@gmail.com Reply-To: jstowe@cyclecomputing.com Received: by 10.229.237.78 with HTTP; Tue, 15 Jun 2010 12:10:16 -0700 (PDT) In-Reply-To: References: Date: Tue, 15 Jun 2010 15:10:16 -0400 X-Google-Sender-Auth: GnkfFNqwPXMTX4MmEPZPn41D1vw Message-ID: Subject: Re: Problems with HOD and HDFS From: Jason Stowe To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00c09f8a4da81823730489165b8a --00c09f8a4da81823730489165b8a Content-Type: text/plain; charset=ISO-8859-1 Hi David, The original HOD project was integrated with Condor ( http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters. A year or two ago, the Condor project in addition to being open-source w/o costs for licensing, created close integration with Hadoop (as does SGE), as presented by me at a prior Hadoop World, and the Condor team at Condor Week 2010: http://bit.ly/Condor_Hadoop_CondorWeek2010 My company has solutions for deploying Hadoop Clusters on shared infrastructure using CycleServer and schedulers like Condor/SGE/etc. The general deployment strategy is to deploy head nodes (Name/Job Tracker), then execute nodes, and to be careful about how you deal with data/sizing/replication counts. If you're interested in this, please feel free to drop us a line at my e-mail or http://cyclecomputing.com/about/contact Thanks, Jason On Mon, Jun 14, 2010 at 7:45 PM, David Milne wrote: > Unless I am missing something, the Fair Share and Capacity schedulers > sound like a solution to a different problem: aren't they for a > dedicated Hadoop cluster that needs to be shared by lots of people? I > have a general purpose cluster that needs to be shared by lots of > people. Only one of them (me) wants to run hadoop, and only wants to > run it intermittently. I'm not concerned with data locality, as my > workflow is: > > 1) upload data I need to process to cluster > 2) run a chain of map-reduce tasks > 3) grab processed data from cluster > 4) clean up cluster > > Mesos sounds good, but I am definitely NOT brave about this. As I > said, I am just one user of the cluster among many. I would want to > stick with Torque and Maui for resource management. > > - Dave > > On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah wrote: > > Dave, > > > > Yes, many others have the same situation, the recommended solution is > > either to use the Fair Share Scheduler or the Capacity Scheduler. These > > schedulers are much better than HOD since they take data locality into > > consideration (they don't just spin up 20 TT nodes on machines that have > > nothing to do with your data). They also don't lock down the nodes just > for > > you, so as TT are freed other jobs can use them immediately (as opposed > to > > no body can use them till your entire job is done). > > > > Also, if you are brave and want to try something spanking new, then I > > recommend you reach out to the Mesos guys, they have a scheduler layer > under > > Hadoop that is data locality aware: > > > > http://mesos.berkeley.edu/ > > > > -- amr > > > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne > wrote: > > > >> Ok, thanks Jeff. > >> > >> This is pretty surprising though. I would have thought many people > >> would be in my position, where they have to use Hadoop on a general > >> purpose cluster, and need it to play nice with a resource manager? > >> What do other people do in this position, if they don't use HOD? > >> Deprecated normally means there is a better alternative. > >> > >> - Dave > >> > >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher > > >> wrote: > >> > Hey Dave, > >> > > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I > >> don't > >> > think HOD is actively used or developed anywhere these days. You're > >> > attempting to use a mostly deprecated project, and hence not receiving > >> any > >> > support on the mailing list. > >> > > >> > Thanks, > >> > Jeff > >> > > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne > >> wrote: > >> > > >> >> Anybody? I am completely stuck here. I have no idea who else I can > ask > >> >> or where I can go for more information. Is there somewhere specific > >> >> where I should be asking about HOD? > >> >> > >> >> Thank you, > >> >> Dave > >> >> > >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne > >> wrote: > >> >> > Hi there, > >> >> > > >> >> > I am trying to get Hadoop on Demand up and running, but am having > >> >> > problems with the ringmaster not being able to communicate with > HDFS. > >> >> > > >> >> > The output from the hod allocate command ends with this, with full > >> >> verbosity: > >> >> > > >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to > retrieve > >> >> > 'hdfs' service address. > >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster > id > >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be > allocated. > >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop() > >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from > >> rm.stop() > >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate > >> >> > cluster /home/dmilne/hadoop/cluster > >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7 > >> >> > > >> >> > > >> >> > I've attached the hodrc file below, but briefly HOD is supposed to > >> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and > seems > >> >> > to be failing to do so. The ringmaster log looks like this: > >> >> > > >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr > >> name: > >> >> hdfs > >> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr > >> >> > service: > >> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr > >> >> > addr hdfs: not found > >> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr > >> name: > >> >> hdfs > >> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr > >> >> > service: > >> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr > >> >> > addr hdfs: not found > >> >> > > >> >> > ... and so on, until it gives up > >> >> > > >> >> > Any ideas why? One red flag is that when running the allocate > command, > >> >> > some of the variables echo-ed back look dodgy: > >> >> > > >> >> > --gridservice-hdfs.fs_port 0 > >> >> > --gridservice-hdfs.host localhost > >> >> > --gridservice-hdfs.info_port 0 > >> >> > > >> >> > These are not what I specified in the hodrc. Are the port numbers > just > >> >> > set to 0 because I am not using an external HDFS, or is this a > >> >> > problem? > >> >> > > >> >> > > >> >> > The software versions involved are: > >> >> > - Hadoop 0.20.2 > >> >> > - Python 2.5.2 (no Twisted) > >> >> > - Java 1.6.0_20 > >> >> > - Torque 2.4.5 > >> >> > > >> >> > > >> >> > The hodrc file looks like this: > >> >> > > >> >> > [hod] > >> >> > stream = True > >> >> > java-home = /opt/jdk1.6.0_20 > >> >> > cluster = debian5 > >> >> > cluster-factor = 1.8 > >> >> > xrs-port-range = 32768-65536 > >> >> > debug = 3 > >> >> > allocate-wait-time = 3600 > >> >> > temp-dir = /scratch/local/dmilne/hod > >> >> > > >> >> > [ringmaster] > >> >> > register = True > >> >> > stream = False > >> >> > temp-dir = /scratch/local/dmilne/hod > >> >> > log-dir = /scratch/local/dmilne/hod/log > >> >> > http-port-range = 8000-9000 > >> >> > idleness-limit = 864000 > >> >> > work-dirs = > >> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2 > >> >> > xrs-port-range = 32768-65536 > >> >> > debug = 4 > >> >> > > >> >> > [hodring] > >> >> > stream = False > >> >> > temp-dir = /scratch/local/dmilne/hod > >> >> > log-dir = /scratch/local/dmilne/hod/log > >> >> > register = True > >> >> > java-home = /opt/jdk1.6.0_20 > >> >> > http-port-range = 8000-9000 > >> >> > xrs-port-range = 32768-65536 > >> >> > debug = 4 > >> >> > > >> >> > [resource_manager] > >> >> > queue = express > >> >> > batch-home = /opt/torque-2.4.5 > >> >> > id = torque > >> >> > options = > >> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB" > >> >> > #env-vars = > >> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python > >> >> > > >> >> > [gridservice-mapred] > >> >> > external = False > >> >> > pkgs = /opt/hadoop-0.20.2 > >> >> > tracker_port = 8030 > >> >> > info_port = 50080 > >> >> > > >> >> > [gridservice-hdfs] > >> >> > external = False > >> >> > pkgs = /opt/hadoop-0.20.2 > >> >> > fs_port = 8020 > >> >> > info_port = 50070 > >> >> > > >> >> > Cheers, > >> >> > Dave > >> >> > > >> >> > >> > > >> > > > -- ================================== Jason A. Stowe cell: 607.227.9686 main: 888.292.5320 http://twitter.com/jasonastowe/ http://twitter.com/cyclecomputing/ Cycle Computing, LLC Leader in Open Compute Solutions for Clouds, Servers, and Desktops Enterprise Condor Support and Management Tools http://www.cyclecomputing.com http://www.cyclecloud.com --00c09f8a4da81823730489165b8a--