My actual problem is to rank all values and then run logic 1 to top n%
values and logic 2 to rest values.
1st  Ranking ? (need major suggestions here)
2nd  Find top n% out of them.
Then rest is covered.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <lakechang@gmail.com> wrote:
> there's one thing i want to clarify that you can use multireducers to sort
> the data globally and then cat all the parts to get the top n records. The
> data in all parts are globally in order.
> Then you may find the problem is much easier.
>
> 在 201322 下午3:18，"praveenesh kumar" <praveenesh@gmail.com>写道：
>
>> Actually what I am trying to find to top n% of the whole data.
>> This n could be very large if my data is large.
>>
>> Assuming I have uniform rows of equal size and if the total data size
>> is 10 GB, using the above mentioned approach, if I have to take top
>> 10% of the whole data set, I need 10% of 10GB which could be rows
>> worth of 1 GB (roughly) in my mappers.
>> I think that would not be possible given my input splits are of
>> 64/128/512 MB (based on my block size) or am I making wrong
>> assumptions. I can increase the inputsplit size, but is there a better
>> way to find top n%.
>>
>>
>> My whole actual problem is to give ranks to some values and then find
>> out the top 10 ranks.
>>
>> I think this context can give more idea about the problem ?
>>
>> Regards
>> Praveenesh
>>
>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpichov@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Can you tell more about:
>> > * How big is N
>> > * How big is the input dataset
>> > * How many mappers you have
>> > * Do input splits correlate with the sorting criterion for top N?
>> >
>> > Depending on the answers, very different strategies will be optimal.
>> >
>> >
>> >
>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
>> > <praveenesh@gmail.com>wrote:
>> >
>> >> I am looking for a better solution for this.
>> >>
>> >> 1 way to do this would be to find top N values from each mappers and
>> >> then find out the top N out of them in 1 reducer. I am afraid that
>> >> this won't work effectively if my N is larger than number of values in
>> >> my inputsplit (or mapper input).
>> >>
>> >> Otherway is to just sort all of them in 1 reducer and then do the cat
>> >> of
>> >> topN.
>> >>
>> >> Wondering if there is any better approach to do this ?
>> >>
>> >> Regards
>> >> Praveenesh
>> >>
>> >
>> >
>> >
>> > 
>> > Eugene Kirpichov
>> > http://www.linkedin.com/in/eugenekirpichov
>> > http://jkff.info/software/timeplotters  my performance visualization
>> > tools
