Return-Path: Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: (qmail 22215 invoked from network); 4 Feb 2010 09:56:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Feb 2010 09:56:24 -0000 Received: (qmail 86997 invoked by uid 500); 4 Feb 2010 09:56:24 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 86907 invoked by uid 500); 4 Feb 2010 09:56:24 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 86897 invoked by uid 99); 4 Feb 2010 09:56:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 09:56:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zjffdu@gmail.com designates 209.85.222.173 as permitted sender) Received: from [209.85.222.173] (HELO mail-pz0-f173.google.com) (209.85.222.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 09:56:16 +0000 Received: by pzk3 with SMTP id 3so266708pzk.5 for ; Thu, 04 Feb 2010 01:55:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=pTyuselzOfaoV9twSHZlSpJYVPoc/JmYvZiVl9QtiA8=; b=XTRTqpKlPxVSfs7I+2dRUjdLCDHLYXVp6owB2+w1Ug252aWuFLWRNFaW1G9Yl7qUF/ 72FsOAEH9qVYR42YSrZy7DuUYTrVcbiKZdSFXWNc0A4dyMpTsKuBwcaK8vmTKaES2mAO yAXPdcykKJMyvT7XFjP6U56G7PsMcfUrX6FX8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=etiEcbPY0gPqQSnuqfSWj8XL0oQszmZQZDAWVsemLrlLC8BF3yWgl1oTzYPpxd3BKr yEo3qLE5OfMt7alqpK5XjjXbpW5CFNJGEmjpVtj2SA64tEn4uyRb1/70cpuTC7Fzjs+z 99oOW+6mpCCf2ple4nLRuhD6UtcSQzStCdC3A= MIME-Version: 1.0 Received: by 10.142.5.10 with SMTP id 10mr552586wfe.334.1265277354396; Thu, 04 Feb 2010 01:55:54 -0800 (PST) In-Reply-To: <5a8ac5821002040010m73f256a1g7cb8a6c5a3f95d76@mail.gmail.com> References: <5a8ac5821002040010m73f256a1g7cb8a6c5a3f95d76@mail.gmail.com> Date: Thu, 4 Feb 2010 17:55:54 +0800 Message-ID: <8211a1321002040155m55bca611ha6f9e2561e61f390@mail.gmail.com> Subject: Re: The idea to enhance MapReduce to resolve the skew problem From: Jeff Zhang To: mapreduce-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=00504502b960404edf047ec357d7 X-Virus-Checked: Checked by ClamAV on apache.org --00504502b960404edf047ec357d7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, Do you mean do resplitting and recombining in each mapper task ? I am sure what the purpose, as my understanding, the Partitioner determine which reducer the output of mapper task go. So I don't think you method can solve the skew problem. 2010/2/4 =E6=98=93=E5=89=91 > Currently, only map tasks are balanced, and reduce tasks possible are ske= w, > the timeslice is also different, which lead the scheduler is not smart. I > have an idea to improve it. > > We can break the output of map to N*M splits, N is the number of nodes, a= nd > M >=3D1=EF=BC=8Cand regroup to new splits bycombining the smaller splits = and > resplitting the bigger splits, until the size of every splits is balanced > with the specified value. > > There are three cases: > 1. Too many values for a key > 2. Too many keys hash to a partition > 3. Every partition is balanced in the size > > If too many values for a key, adding a new MapReduce procedure is > necessary. > If too many keys hash to a partition, resplitting is necessary. > > If every splitting is balanced, we can consider a task (map or reduce) to= a > scheduler timeslice, the scheduler will be smart like OS's scheduler. > --=20 Best Regards Jeff Zhang --00504502b960404edf047ec357d7--