hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to handle imbalanced data in hadoop ?
Date Wed, 18 Nov 2009 00:05:57 GMT
Can someone fix the typo on http://wiki.apache.org/pig/PigSkewedJoinSpec in
the first bullet ?
tow-table inner join

Thanks

On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <forpankil@gmail.com> wrote:

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>
> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
> I wanted some hints how sorting is done ..
>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolwell@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjffdu@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message