Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 40011 invoked from network); 19 May 2006 18:59:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 19 May 2006 18:59:48 -0000 Received: (qmail 67356 invoked by uid 500); 19 May 2006 18:59:48 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 67330 invoked by uid 500); 19 May 2006 18:59:47 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 67321 invoked by uid 99); 19 May 2006 18:59:47 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 May 2006 11:59:47 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.115.57.74] (HELO ylpvm43.prodigy.net) (207.115.57.74) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 May 2006 11:59:47 -0700 Received: from pimout5-ext.prodigy.net (pimout5-int.prodigy.net [207.115.4.21]) by ylpvm43.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id k4JIxRWJ026099 for ; Fri, 19 May 2006 14:59:27 -0400 X-ORBL: [69.228.218.244] Received: from [192.168.168.15] (adsl-69-228-218-244.dsl.pltn13.pacbell.net [69.228.218.244]) by pimout5-ext.prodigy.net (8.13.6 out.dk/8.13.6) with ESMTP id k4JIxOTT278052; Fri, 19 May 2006 14:59:25 -0400 Message-ID: <446E158B.9060203@apache.org> Date: Fri, 19 May 2006 11:59:23 -0700 From: Doug Cutting User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: Job scheduling (Re: Unable to run more than one job concurrently) References: <20060519180104.D58BC10FB004@asf.osuosl.org> In-Reply-To: <20060519180104.D58BC10FB004@asf.osuosl.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Paul Sutter wrote: > (1) Allow submission times in the future, enabling the creation of > "background" jobs. My understanding is that job submission times are used to > prioritize scheduling. All tasks from a job submitted early run to > completion before those of a job submitted later. If we could submit any > days-long jobs with a submission time in the future, say the year 2010, and > any short hours-long jobs with the current time, that short job would be > able to interrupt the long job. Hack? Yes. Useful? I think so. I think this is equivalent to adding a job priority, where tasks with the highest priority job are run first. If jobs are at the same priority, then the first submitted would run. Adding priority would add a bit more complexity, but would also be less of a hack. > (2) Have a per-job total task count limit. Currently, we establish the > number of tasks each node runs, and how many map or reduce tasks we have > total in a given job. But it would be great if we could set a ceiling on the > number of tasks that run concurrently for a given job. This may help with > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent > jobs would be fine?). I like this idea. So if the highest-priority job is already running at its task limit, then tasks can be run from the next highest-priority job. Should there be separate limits for maps and reduces? > (3) Don't start the reducers until a certain number of mappers have > completed (25%? 75%? 90%?). This optimization of starting early will be less > important when we've solved the map output copy problems. Would this only be done if there were other jobs competing for reduce slots? Otherwise this could have an impact on performance. In theory, network bandwidth is the primary MapReduce bottleneck. Map input usually doesn't cross the network, but map output must generally cross the network to reduce nodes. Sorting can be done in parallel with mapping and copying of map output to reduce nodes, but reduce output cannot begin until all map output has arrived. Finally, in general, reduce output must cross the network, for replication. So the data must cross the network twice: once between map and reduce and once on reduce output, and these two transfers are serialized. Since the network is the bottleneck, we will get the best performance when we saturate it continuously. So we should thus begin transferring map outputs as soon as they are available. That's the theory. Doug