hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pankil Doshi <forpan...@gmail.com>
Subject Re: How to handle imbalanced data in hadoop ?
Date Tue, 24 Nov 2009 18:35:11 GMT
Thanks a lot Todd.

Its very good suggestion. Will change my reducer code and see how does it
goes.

Pankil


On Tue, Nov 24, 2009 at 11:18 AM, Todd Lipcon <todd@cloudera.com> wrote:

> I don't disagree, but as I noted above, the majority of the issue Pankil
> saw
> was an inefficient reducer that was extremely slow for large value sets for
> a given key. When I fixed that issue, the sort time was within the noise
> margin between the "all the same" and "9 different values" reducers.
>
> -Todd
>
> On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <zjffdu@gmail.com> wrote:
>
> > agree, the worst case for quick sort is O(n2) if all the data is the
> same.
> >
> > and in this case the running for heapsort is O(n*log n)
> >
> > Jeff Zhang
> >
> >
> > On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <ted.xu.ml@gmail.com> wrote:
> >
> > > 2009/11/18 Pankil Doshi <forpankil@gmail.com>
> > >
> > > > With respect to Imbalanced data, Can anyone guide me how sorting
> takes
> > > > place
> > > > in Hadoop after Map phase.
> > > >
> > > > I did some experiments and found that if there are two reducers which
> > > have
> > > > same number of keys to sort and one reducer has all the keys same and
> > > other
> > > > have different keys then time taken by by the reducer having all keys
> > > same
> > > > is terribly large then other one.
> > > >
> > >
> > > Maybe that is caused by the sorting algorithm.
> > >
> > > It is make sence if we consider the default sort algorithm is quick
> sort.
> > > In quick sort if
> > > all the sorting data is quite the same, sort iteration will be very
> deep.
> > >
> > > Would you please try changing the job configuration "map.sort.class" to
> > be
> > > org.apache.hadoop.util.HeapSort?
> > >
> > >
> > > > Also I found that length on my Key doesnt matter in the time taken to
> > > sort
> > > > it.
> > > >
> > > > I wanted some hints how sorting is done ..
> > > >
> > > > Pankil
> > > >
> > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> > hammer@cloudera.com
> > > > >wrote:
> > > >
> > > > > Hey Jeff,
> > > > >
> > > > > You may be interested in the Skewed Design specification from the
> Pig
> > > > team:
> > > > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > > > >
> > > > > Regards,
> > > > > Jeff
> > > > >
> > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolwell@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > My first thought is that it depends on the reduce logic. If
you
> > could
> > > > do
> > > > > > the
> > > > > > reduction in two passes then you could do an initial arbitrary
> > > > partition
> > > > > > for
> > > > > > the majority key and bring the partitions together in a second
> > > > reduction
> > > > > > (or
> > > > > > a map-side join). I would use a round robin strategy to assign
> the
> > > > > > arbitrary
> > > > > > partitions.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjffdu@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Today there's a problem about imbalanced data come out
of mind
> .
> > > > > > >
> > > > > > > I'd like to know how hadoop handle this kind of data. 
e.g. one
> > key
> > > > > > > dominates the map output, say 99%. So 99% data set will
go to
> one
> > > > > > reducer,
> > > > > > > and this reducer will become the bottleneck.
> > > > > > >
> > > > > > > Does hadoop have any other better ways to handle such
> imbalanced
> > > data
> > > > > set
> > > > > > ?
> > > > > > >
> > > > > > >
> > > > > > > Jeff Zhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message