Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 25915 invoked from network); 18 Nov 2009 22:55:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Nov 2009 22:55:57 -0000 Received: (qmail 50143 invoked by uid 500); 18 Nov 2009 22:55:54 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 50067 invoked by uid 500); 18 Nov 2009 22:55:54 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 50057 invoked by uid 99); 18 Nov 2009 22:55:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Nov 2009 22:55:54 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.85.222.198] (HELO mail-pz0-f198.google.com) (209.85.222.198) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Nov 2009 22:55:45 +0000 Received: by pzk36 with SMTP id 36so1130091pzk.5 for ; Wed, 18 Nov 2009 14:55:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.66.42 with SMTP id o42mr414265wfa.29.1258584924255; Wed, 18 Nov 2009 14:55:24 -0800 (PST) In-Reply-To: <445c748b0911181116h6419102csdd58cba7e396fed0@mail.gmail.com> References: <8211a1320911142003o23af63deica43a5d78c41de9@mail.gmail.com> <445c748b0911171354k71443f43ifad5a8a956f9c3bd@mail.gmail.com> <45f85f70911171407n7071867bp386c3e8b9d3fcd05@mail.gmail.com> <445c748b0911181116h6419102csdd58cba7e396fed0@mail.gmail.com> From: Todd Lipcon Date: Wed, 18 Nov 2009 14:55:04 -0800 Message-ID: <45f85f70911181455y3e72b859rc4908596e8e17aa7@mail.gmail.com> Subject: Re: How to handle imbalanced data in hadoop ? To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636e0ba3b54762c0478ad235b X-Virus-Checked: Checked by ClamAV on apache.org --001636e0ba3b54762c0478ad235b Content-Type: text/plain; charset=ISO-8859-1 Hi Pankil, Thanks for sending these along. I'll try to block out some time this week to take a look. -Todd On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi wrote: > Hey Todd, > > I will attach dataset and java source used by me. Make sure you use with 10 > reducers and also use partitioner class as I have provided. > > Dataset-1 has smaller key length > Dataset-2 has larger key length > > When I experiment with both dataset , According my partitioner class > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take > maximum amount of time in all reducers.( like 17 mins) where as remaining > all reducers also get 100000 keys but they all are not same , and these > reducers get over in (1 min 30 sec on avg). > > Pankil > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon wrote: > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi >> wrote: >> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes >> > place >> > in Hadoop after Map phase. >> > >> > I did some experiments and found that if there are two reducers which >> have >> > same number of keys to sort and one reducer has all the keys same and >> other >> > have different keys then time taken by by the reducer having all keys >> same >> > is terribly large then other one. >> > >> > >> Hi Pankil, >> >> This is an interesting experiment you've done with results that I wouldn't >> quite expect. Do you have the java source available that you used to run >> this experiment? >> >> >> >> > Also I found that length on my Key doesnt matter in the time taken to >> sort >> > it. >> > >> > >> With small keys on CPU-bound workload this is probably the case since the >> sort would be dominated by comparison. If you were to benchmark keys that >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. >> >> >> > I wanted some hints how sorting is done .. >> > >> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to look. >> The actual sort is a relatively basic quicksort, but there is plenty of >> complexity in the spill/shuffle/merge logic. >> >> -Todd >> >> >> >> > >> > Pankil >> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher > > >wrote: >> > >> > > Hey Jeff, >> > > >> > > You may be interested in the Skewed Design specification from the Pig >> > team: >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec. >> > > >> > > Regards, >> > > Jeff >> > > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell >> > wrote: >> > > >> > > > My first thought is that it depends on the reduce logic. If you >> could >> > do >> > > > the >> > > > reduction in two passes then you could do an initial arbitrary >> > partition >> > > > for >> > > > the majority key and bring the partitions together in a second >> > reduction >> > > > (or >> > > > a map-side join). I would use a round robin strategy to assign the >> > > > arbitrary >> > > > partitions. >> > > > >> > > > >> > > > >> > > > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang >> wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > Today there's a problem about imbalanced data come out of mind . >> > > > > >> > > > > I'd like to know how hadoop handle this kind of data. e.g. one >> key >> > > > > dominates the map output, say 99%. So 99% data set will go to one >> > > > reducer, >> > > > > and this reducer will become the bottleneck. >> > > > > >> > > > > Does hadoop have any other better ways to handle such imbalanced >> data >> > > set >> > > > ? >> > > > > >> > > > > >> > > > > Jeff Zhang >> > > > > >> > > > >> > > >> > >> > > --001636e0ba3b54762c0478ad235b--