hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richa Khandelwal <richa...@gmail.com>
Subject Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value
Date Thu, 12 Mar 2009 13:32:10 GMT
I am running the same test and job that completes in 10 mins for (hk,lv)
case takes  is still running after 30mins have passed for (sk,hv) case.
Would be interesting to pinpoint the reason behind it.
On Wed, Mar 11, 2009 at 1:27 PM, Gyanit <gyanit@gmail.com> wrote:

>
> Here are exact numbers:
> # of (k,v) pairs = 1.2 Mil this is same.
> # of unique k = 1000, k is integer.
> # of  unique v = 1Mil, v is a big big string.
> For a given k, cumulative size of all v associated to it is about 30 Mb.
> (That is each v is about 25~30Kb)
> # of Mappers = 30
> # of Reducers = 10
>
> (v,k) is atleast 4/5 times faster than (k,v).
>
> -Gyanit
>
>
> Scott Carey wrote:
> >
> > Well if the smaller keys are producing fewer unique values, there should
> > be some more significant differences.
> >
> > I had assumed that your test produced the same number of unique values.
> >
> > I'm still not sure why there would be that significant of a difference as
> > long as the total number of unique values in the small key test is a good
> > deal larger than the number of reducers and there is not too much skew in
> > the bucket sizes.  If there are a small subset of keys in the small key
> > test that contain a large subset of the values, then the reducers will
> > have very skewed work sizes and this could explain your observation.
> >
> >
> > On 3/11/09 11:50 AM, "Gyanit" <gyanit@gmail.com> wrote:
> >
> >
> >
> > I notices one more thing. Lighter keys tend to make smaller number of
> > unique
> > keys.
> > For example (key,value) pairs may be 10Mil, but if key is lighter unique
> > keys might be just 1000.
> > In other case if keys are heavier unique keys might be 5 mil.
> > I think this might have something to do with it.
> > Bottom line: If your reduce is simple dump and no combining, the put data
> > in
> > keys than values.
> >
> > I need to put data in values. Any suggestions on how to make it faster.
> >
> > -Gyanit.
> >
> >
> > Scott Carey wrote:
> >>
> >> That is a fascinating question.  I would also love to know the reason
> >> behind this.
> >>
> >> If I were to guess I would have thought that smaller keys and heavier
> >> values would slightly outperform, rather than significantly
> underperform.
> >> (assuming total pair count at each phase is the same).   Perhaps there
> is
> >> room for optimization here?
> >>
> >>
> >>
> >> On 3/10/09 6:44 PM, "Gyanit" <gyanit@gmail.com> wrote:
> >>
> >>
> >>
> >> I have large number of key,value pairs. I don't actually care if data
> >> goes
> >> in
> >> value or key. Let me be more exact.
> >> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for
> each
> >> pair. I can put it in keys or values.
> >> I have experimented with both options (heavy key , light value)  vs
> >> (light
> >> key, heavy value). It turns out that hk,lv option is much much better
> >> than
> >> (lk,hv).
> >> Has someone else also noticed this?
> >> Is there a way to make things faster in light key , heavy value option.
> >> As
> >> some application will need that also.
> >> Remember in both cases we are talking about atleast dozen or so million
> >> pairs.
> >> There is a difference of time in shuffle phase. Which is weird as amount
> >> of
> >> data transferred is same.
> >>
> >> -gyanit
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >>
> >>
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463784.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message