Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <445c748b0911181116h6419102csdd58cba7e396fed0@mail.gmail.com>
References: <8211a1320911142003o23af63deica43a5d78c41de9@mail.gmail.com>
	<ed5105470911151400o4351b219qba78df3f7dcf6601@mail.gmail.com>
	<c9118d4c0911151625w710db5e4v68c3711eb24d8035@mail.gmail.com>
	<445c748b0911171354k71443f43ifad5a8a956f9c3bd@mail.gmail.com>
	<45f85f70911171407n7071867bp386c3e8b9d3fcd05@mail.gmail.com>
	<445c748b0911181116h6419102csdd58cba7e396fed0@mail.gmail.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Wed, 18 Nov 2009 14:55:04 -0800
Message-ID: <45f85f70911181455y3e72b859rc4908596e8e17aa7@mail.gmail.com>
Subject: Re: How to handle imbalanced data in hadoop ?
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636e0ba3b54762c0478ad235b

--001636e0ba3b54762c0478ad235b
Content-Type: text/plain; charset=ISO-8859-1

Hi Pankil,

Thanks for sending these along. I'll try to block out some time this week to
take a look.

-Todd

On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <forpankil@gmail.com> wrote:

> Hey  Todd,
>
> I will attach dataset and java source used by me. Make sure you use with 10
> reducers and also use partitioner class as I have provided.
>
> Dataset-1 has smaller key length
> Dataset-2 has larger key length
>
> When I experiment with both dataset , According my partitioner class
> Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take
> maximum amount of time in all reducers.( like 17 mins) where as remaining
> all reducers also get 100000 keys but they all are not same , and these
> reducers get over in (1 min 30 sec on avg).
>
> Pankil
>
>
> On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpankil@gmail.com>
>> wrote:
>>
>> > With respect to Imbalanced data, Can anyone guide me how sorting takes
>> > place
>> > in Hadoop after Map phase.
>> >
>> > I did some experiments and found that if there are two reducers which
>> have
>> > same number of keys to sort and one reducer has all the keys same and
>> other
>> > have different keys then time taken by by the reducer having all keys
>> same
>> > is terribly large then other one.
>> >
>> >
>> Hi Pankil,
>>
>> This is an interesting experiment you've done with results that I wouldn't
>> quite expect. Do you have the java source available that you used to run
>> this experiment?
>>
>>
>>
>> > Also I found that length on my Key doesnt matter in the time taken to
>> sort
>> > it.
>> >
>> >
>> With small keys on CPU-bound workload this is probably the case since the
>> sort would be dominated by comparison. If you were to benchmark keys that
>> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.
>>
>>
>> > I wanted some hints how sorting is done ..
>> >
>>
>> MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
>> The actual sort is a relatively basic quicksort, but there is plenty of
>> complexity in the spill/shuffle/merge logic.
>>
>> -Todd
>>
>>
>>
>> >
>> > Pankil
>> >
>> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
>> > >wrote:
>> >
>> > > Hey Jeff,
>> > >
>> > > You may be interested in the Skewed Design specification from the Pig
>> > team:
>> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
>> > >
>> > > Regards,
>> > > Jeff
>> > >
>> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolwell@gmail.com>
>> > wrote:
>> > >
>> > > > My first thought is that it depends on the reduce logic. If you
>> could
>> > do
>> > > > the
>> > > > reduction in two passes then you could do an initial arbitrary
>> > partition
>> > > > for
>> > > > the majority key and bring the partitions together in a second
>> > reduction
>> > > > (or
>> > > > a map-side join). I would use a round robin strategy to assign the
>> > > > arbitrary
>> > > > partitions.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjffdu@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Today there's a problem about imbalanced data come out of mind .
>> > > > >
>> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
>> key
>> > > > > dominates the map output, say 99%. So 99% data set will go to one
>> > > > reducer,
>> > > > > and this reducer will become the bottleneck.
>> > > > >
>> > > > > Does hadoop have any other better ways to handle such imbalanced
>> data
>> > > set
>> > > > ?
>> > > > >
>> > > > >
>> > > > > Jeff Zhang
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

--001636e0ba3b54762c0478ad235b--