Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Message-ID: <446E158B.9060203@apache.org>
Date: Fri, 19 May 2006 11:59:23 -0700
From: Doug Cutting <cutting@apache.org>
User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502)
MIME-Version: 1.0
To: hadoop-dev@lucene.apache.org
Subject: Re: Job scheduling (Re: Unable to run more than one job concurrently)
References: <20060519180104.D58BC10FB004@asf.osuosl.org>
In-Reply-To: <20060519180104.D58BC10FB004@asf.osuosl.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Paul Sutter wrote:
> (1) Allow submission times in the future, enabling the creation of
> "background" jobs. My understanding is that job submission times are used to
> prioritize scheduling. All tasks from a job submitted early run to
> completion before those of a job submitted later. If we could submit any
> days-long jobs with a submission time in the future, say the year 2010, and
> any short hours-long jobs with the current time, that short job would be
> able to interrupt the long job. Hack? Yes. Useful? I think so.

I think this is equivalent to adding a job priority, where tasks with 
the highest priority job are run first.  If jobs are at the same 
priority, then the first submitted would run.  Adding priority would add 
a bit more complexity, but would also be less of a hack.

> (2) Have a per-job total task count limit. Currently, we establish the
> number of tasks each node runs, and how many map or reduce tasks we have
> total in a given job. But it would be great if we could set a ceiling on the
> number of tasks that run concurrently for a given job. This may help with
> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
> jobs would be fine?).

I like this idea.  So if the highest-priority job is already running at 
its task limit, then tasks can be run from the next highest-priority 
job.  Should there be separate limits for maps and reduces?

> (3) Don't start the reducers until a certain number of mappers have
> completed (25%? 75%? 90%?). This optimization of starting early will be less
> important when we've solved the map output copy problems.

Would this only be done if there were other jobs competing for reduce 
slots?  Otherwise this could have an impact on performance.

In theory, network bandwidth is the primary MapReduce bottleneck.  Map 
input usually doesn't cross the network, but map output must generally 
cross the network to reduce nodes.  Sorting can be done in parallel with 
mapping and copying of map output to reduce nodes, but reduce output 
cannot begin until all map output has arrived.  Finally, in general, 
reduce output must cross the network, for replication.  So the data must 
cross the network twice: once between map and reduce and once on reduce 
output, and these two transfers are serialized.  Since the network is 
the bottleneck, we will get the best performance when we saturate it 
continuously.  So we should thus begin transferring map outputs as soon 
as they are available.  That's the theory.

Doug