Maybe look at the pig source to see how it does it?
Russell Jurney http://datasyndrome.com
On Feb 1, 2013, at 11:37 PM, praveenesh kumar <praveenesh@gmail.com> wrote:
> Thanks for that Russell. Unfortunately I can't use Pig. Need to write
> my own MR job. I was wondering how its usually done in the best way
> possible.
>
> Regards
> Praveenesh
>
> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <russell.jurney@gmail.com> wrote:
>> Pig. Datafu. 7 lines of code.
>>
>> https://gist.github.com/4696443
>> https://github.com/linkedin/datafu
>>
>>
>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <praveenesh@gmail.com>wrote:
>>
>>> Actually what I am trying to find to top n% of the whole data.
>>> This n could be very large if my data is large.
>>>
>>> Assuming I have uniform rows of equal size and if the total data size
>>> is 10 GB, using the above mentioned approach, if I have to take top
>>> 10% of the whole data set, I need 10% of 10GB which could be rows
>>> worth of 1 GB (roughly) in my mappers.
>>> I think that would not be possible given my input splits are of
>>> 64/128/512 MB (based on my block size) or am I making wrong
>>> assumptions. I can increase the inputsplit size, but is there a better
>>> way to find top n%.
>>>
>>>
>>> My whole actual problem is to give ranks to some values and then find
>>> out the top 10 ranks.
>>>
>>> I think this context can give more idea about the problem ?
>>>
>>> Regards
>>> Praveenesh
>>>
>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpichov@gmail.com>
>>> wrote:
>>>> Hi,
>>>>
>>>> Can you tell more about:
>>>> * How big is N
>>>> * How big is the input dataset
>>>> * How many mappers you have
>>>> * Do input splits correlate with the sorting criterion for top N?
>>>>
>>>> Depending on the answers, very different strategies will be optimal.
>>>>
>>>>
>>>>
>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveenesh@gmail.com
>>>> wrote:
>>>>
>>>>> I am looking for a better solution for this.
>>>>>
>>>>> 1 way to do this would be to find top N values from each mappers and
>>>>> then find out the top N out of them in 1 reducer. I am afraid that
>>>>> this won't work effectively if my N is larger than number of values in
>>>>> my inputsplit (or mapper input).
>>>>>
>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat
of
>>>>> topN.
>>>>>
>>>>> Wondering if there is any better approach to do this ?
>>>>>
>>>>> Regards
>>>>> Praveenesh
>>>>>
>>>>
>>>>
>>>>
>>>> 
>>>> Eugene Kirpichov
>>>> http://www.linkedin.com/in/eugenekirpichov
>>>> http://jkff.info/software/timeplotters  my performance visualization
>>> tools
>>>
>>
>>
>>
>> 
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
