hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Job scheduling (Re: Unable to run more than one job concurrently)
Date Fri, 19 May 2006 18:29:02 GMT
We're planning to experiment with using existing batch scheduling  
systems for addressing these concerns later in the year.  Condor and  
Torque being the leading contenders.

The thinking is that these systems have huge investments in  
configurable scheduling policies and that it is best to keep hadoop  
simple and leverage these systems to get fine grained / multi-user  
scheduling control.

If this works, the idea is to run a durable HDFS cluster and have the  
batch systems setup task tracker networks for each user on demand.   
This approach is probably more applicable if you have large clusters  
with distinct users / systems sharing them, so this may not address  
your requirements.

In any case this is why my team is not putting a lot of thought into  
this problem in the short term.  That said, I've always anticipated  
that others in the hadoop community might pursue improved  
scheduling.  I just advocate keeping it simple, because when you look  
at condor or torque you will quickly appreciate how unsimple it can  

On May 19, 2006, at 11:01 AM, Paul Sutter wrote:

> A few suggestions to allow for a very simple extension to the current
> scheduling:
> (1) Allow submission times in the future, enabling the creation of
> "background" jobs. My understanding is that job submission times  
> are used to
> prioritize scheduling. All tasks from a job submitted early run to
> completion before those of a job submitted later. If we could  
> submit any
> days-long jobs with a submission time in the future, say the year  
> 2010, and
> any short hours-long jobs with the current time, that short job  
> would be
> able to interrupt the long job. Hack? Yes. Useful? I think so.
> (2) Have a per-job total task count limit. Currently, we establish the
> number of tasks each node runs, and how many map or reduce tasks we  
> have
> total in a given job. But it would be great if we could set a  
> ceiling on the
> number of tasks that run concurrently for a given job. This may  
> help with
> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
> concurrent
> jobs would be fine?).
> (3) Don't start the reducers until a certain number of mappers have
> completed (25%? 75%? 90%?). This optimization of starting early  
> will be less
> important when we've solved the map output copy problems.
> Just a few ideas.
> -----Original Message-----
> From: bpendleton@gmail.com [mailto:bpendleton@gmail.com] On Behalf  
> Of Bryan
> A. Pendleton
> Sent: Friday, May 19, 2006 10:44 AM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: Job scheduling (Re: Unable to run more than one job
> concurrently)
> There are some additional risks to running simultaneous jobs. Right  
> now,
> Hadoop does a very bad job dealing with out-of-space conditions. If  
> you run
> two jobs, where the total amount of temporary space (for map outputs)
> between both jobs is greater than the amount of space available on the
> cluster, then they will both fail. If you run them serially, they  
> should
> both succeed.
> In the very least, it's probably wise to take into account more  
> than just
> scheduling priority in any scheduler. (Expected) temporary space  
> demands,
> bandwidth limits, and size of jobs should be some of the criteria  
> available
> to the scheduler.
> On 5/19/06, Andrzej Bialecki <ab@getopt.org> wrote:
>> Andrzej Bialecki wrote:
>>> Hi all,
>>> I'm running Hadoop on a relatively small cluster (5 nodes) with
>>> growing datasets.
>>> I noticed that if I start a job that is configured to run more map
>>> tasks than is the cluster capacity  
>>> (mapred.tasktracker.tasks.maximum *
>>> number of nodes, 20 in this case), of course only that many map  
>>> tasks
>>> will run, and when they are finished the next map tasks from that  
>>> job
>>> will be scheduled.
>>> However, when I try to start another job in parallel, only its  
>>> reduce
>>> tasks will be scheduled (uselessly spin-waiting for map output, and
>>> only reducing the number of available tasks in the cluster...),  
>>> and no
>>> map tasks from this job will be scheduled - until the first job
>>> completes. This feels wrong - not only I'm not making progress on  
>>> the
>>> second job, but I'm also taking the slots away from the first job!
>>> I'm somewhat miffed about this - I'd think that jobtracker should
>>> split the available resources evenly between these two jobs, i.e. it
>>> should schedule some map tasks from the first job and some from the
>>> second one. This is not what is happening, though ...
>>> Is this a configuration error, a bug, or a feature? :)
>> It seems it's a feature - I found the code in
>> JobTracker.pollForNewTask(), and I'm not too happy about it.
>> Let's consider the following example: if I'm running a Nutch fetcher,
>> the main limitation is the available bandwidth to fetch pages, and  
>> not
>> the capacity of the cluster. I'd love to be able to execute other  
>> jobs
>> in parallel, so that I don't have to wait until fetcher completes. I
>> could sacrifice some of the task slots on tasktrackers for that other
>> job, because the fetcher job wouldn't suffer from this anyway (at  
>> least
>> not too much).
>> So, I'd like to change this code to pick up a random job from the  
>> list
>> jobsByArrival, and take job.obtainNewMapTask from that randomly  
>> selected
>> job. Would that work? Additionally, if no map tasks from that job  
>> have
>> been allocated I'd like to skip adding reduce tasks from that job,  
>> later
>> in lines 721-750.
>> Perhaps we should extend JobInProgress to include a priority, and
>> implement something a la Unix scheduler.
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
> -- 
> Bryan A. Pendleton
> Ph: (877) geek-1-bp

View raw message