My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.

> My actual problem is to rank all values and then run logic 1 to top n%
> values and logic 2 to rest values.
> 1st  Ranking ? (need major suggestions here)
> 2nd  Find top n% out of them.
> Then rest is covered.
On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <lakechang@gmail.com> wrote:
> > there's one thing i want to clarify that you can use multireducers to
> sort
> > the data globally and then cat all the parts to get the top n records.
> The
> > data in all parts are globally in order.
> > Then you may find the problem is much easier.
在 201322 下午3:18，"praveenesh kumar" <praveenesh@gmail.com>写道：
> >
> >> Actually what I am trying to find to top n% of the whole data.
> >> This n could be very large if my data is large.
> >> Assuming I have uniform rows of equal size and if the total data size
> >> is 10 GB, using the above mentioned approach, if I have to take top
> >> 10% of the whole data set, I need 10% of 10GB which could be rows
> >> worth of 1 GB (roughly) in my mappers.
> >> I think that would not be possible given my input splits are of
> >> 64/128/512 MB (based on my block size) or am I making wrong
> >> assumptions. I can increase the inputsplit size, but is there a better
> >> way to find top n%.
> >> My whole actual problem is to give ranks to some values and then find
> >> out the top 10 ranks.
> >>
> >> I think this context can give more idea about the problem ?
> >>
> >> > Hi,
> >> >
> >> > Can you tell more about:
> >> > * How big is N
> >> > * How big is the input dataset
> >> > * How many mappers you have
> >> > * Do input splits correlate with the sorting criterion for top N?
> >> >
> >> > Depending on the answers, very different strategies will be optimal.
> >> >
> >> >> I am looking for a better solution for this.
> >> >>
> >> >> 1 way to do this would be to find top N values from each mappers and
> >> >> then find out the top N out of them in 1 reducer. I am afraid that
> >> >> this won't work effectively if my N is larger than number of values
> in
> >> >> my inputsplit (or mapper input).
> >> >>
> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat
> >> >> of
> >> >> topN.
> >> >>
> >> >> Wondering if there is any better approach to do this ?
> >> >>
> >> >> Regards
> >> >> Praveenesh
> >> >>
