Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 4734 invoked from network); 25 Jul 2006 22:52:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 Jul 2006 22:52:56 -0000 Received: (qmail 57309 invoked by uid 500); 25 Jul 2006 22:52:56 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 57123 invoked by uid 500); 25 Jul 2006 22:52:56 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 57114 invoked by uid 99); 25 Jul 2006 22:52:56 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jul 2006 15:52:56 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of sutter@gmail.com designates 64.233.162.204 as permitted sender) Received: from [64.233.162.204] (HELO nz-out-0102.google.com) (64.233.162.204) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Jul 2006 15:52:55 -0700 Received: by nz-out-0102.google.com with SMTP id s1so530765nze for ; Tue, 25 Jul 2006 15:52:34 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=fzYmd5QzfgD/dkfVU1MAP36KvWzn2nwO9KF+FXbpROl60IZV/xhBdjURxDVJLfFVVmOYoQ+OdsAMQ63GZrqxBM6qYcG1M4NE7P5I92EGGVHRwO0H0cijrLTd+6ZruYgyA30i94ZnFioqX7Pq/SLgIuh+uwmIUQVJjc5/+nvyCoQ= Received: by 10.64.21.6 with SMTP id 6mr6124599qbu; Tue, 25 Jul 2006 15:52:34 -0700 (PDT) Received: by 10.36.132.4 with HTTP; Tue, 25 Jul 2006 15:52:34 -0700 (PDT) Message-ID: Date: Tue, 25 Jul 2006 15:52:34 -0700 From: "Paul Sutter" To: hadoop-user@lucene.apache.org Subject: Re: Task type priorities during scheduling ? In-Reply-To: <014f01c6b03b$fe6ccca0$496715ac@ds.corp.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <014f01c6b03b$fe6ccca0$496715ac@ds.corp.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N But the first reducers dont enter the reduce phase until _all_ of the mappers have finished (they need one file from each, even the last mapper to run. During this time (could be many hours), the reduce task is in copy phase, waiting around, taking up that task slot and the resources that go with it. The suggestion is this: only charge against the reduce task limit during the reduce phase, not during the copy (shuffle) phase. This would solve that window. * * Incidently, we run two parallel sets of TaskTrackers on the same cluster, one that is niced and one that is not (we call them yellow and blue, named after the Ganglia CPU colors). These run against a single DFS instance, giving us a fantastic foreground-background capability. Kevin will be posting a couple minor patches that makes this possible. On 7/25/06, Yoram Arnon wrote: > You're looking for more flexibility - to use those idle reduce slots to get > some useful map tasks executed from the next job. You could alternatively > schedule more map tasks from the current job to get it done faster, and > scale back once reduces start kicking in. That makes more sense than > reserving slots for some future job, or forcing multiplexing of jobs in a > single map-reduce cluster. > > So rather than having a hard limit of tasks per type per node, allow nodes > to run additional tasks while the reduces are just fetching data, and once > the data has arrived have the reduces wait for those extra tasks to complete > before starting their processing. > > It would add a bit of complexity to task scheduling, but it would speed up > jobs. > > If you routinely interleave small jobs with large jobs, you may consider > setting aside a small subset of your cluster as a separate map-reduce > cluster, using a common DFS. Then you'd run your large jobs on the large > cluster and the small jobs on the small cluster, sharing data between them. > > Yoram > > -----Original Message----- > From: Paul Sutter [mailto:sutter@gmail.com] > Sent: Tuesday, July 25, 2006 2:55 PM > To: hadoop-user@lucene.apache.org > Subject: Re: Task type priorities during scheduling ? > > Perfect. Thanks Yoram. > > And here's the situation, help me out if I have this wrong. > > Lets say the 20 hour job has 10 hours of mapping to undergo. Thats 10 > hours when its first reduce tasks are filling the available reduce > slots doing a little bit of copying, and a whole lot of nothing at > all. > > Meanwhile, I would want to run a little 20 minute job, whose reduce > tasks would have to wait 10 hours for the first reducers of the big > job to complete (so too all those resources that are set aside for > reduce tasks, such as the sorter RAM going idle, because the copy > phase certainly doesnt need it). > > Do I have that right, as it stands now? > > Paul > > On 7/25/06, Yoram Arnon wrote: > > There is, actually, support for multiple jobs. Maps are scheduled > separately > > from reduces, and when the current job can not saturate the cluster then > the > > next job's tasks get scheduled, and the next. I've seen several small jobs > > execute concurrently on my largish clusters. > > Reduces for a given job won't get scheduled before maps of that job are > > scheduled, but that makes perfect sense - they'll have no work to do. Once > > map tasks start getting scheduled though, if there are available reduce > > slots, they'll get assigned reduce tasks. > > > > Yoram > > > > -----Original Message----- > > From: Paul Sutter [mailto:sutter@gmail.com] > > Sent: Tuesday, July 25, 2006 11:01 AM > > To: hadoop-user@lucene.apache.org > > Subject: Re: Task type priorities during scheduling ? > > > > First, It matters in the case of concurrent jobs. If you submit a 20 > > minute job while a 20 hour job is running, it would be nice if the > > reducers for the 20 minute job could get a chance to run before the 20 > > hour job's mappers have all finished. So even without a throughput > > improvement, you have an important capability (although it may require > > another minor tweak or two to make possible). > > > > Secondarily, we often have stragglers, where one mapper runs slower > > than the others. When this happens, we end up with a largely idle > > cluster for as long as an hour. In cases like these, good support for > > concurrent jobs _would_ improve throughput. > > > > Paul > > > > On 7/25/06, Doug Cutting wrote: > > > Paul Sutter wrote: > > > > it should be possible to have lots of tasks in the shuffle phase > > > > (mostly, sitting around waiting for mappers to run), but only have > > > > "about" one actual reduce phase running per cpu (or whatever works for > > > > each of our apps) that gets enough memory for a sorter, does > > > > substantial computation, etc. > > > > > > Ah, now I see your point, although I don't see how this would improve > > > overall throughput. In most cases, the optimal configuration is for the > > > total number of reduce tasks to be roughly the total number of reduces > > > that can run at once. So there is no queue of waiting reduce tasks to > > > schedule. > > > > > > Doug > > > > > > > > > > > > > > >