hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Restricting number of records from map output
Date Fri, 14 Jan 2011 16:46:40 GMT
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes

Mime
View raw message