hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: How to select random n records using mapreduce ?
Date Mon, 27 Jun 2011 20:35:15 GMT
Building on this, you could do something like the following to make it 
more random:

if (numRecordsWritten < NUM_RECORDS_DESIRED) {
	int n = generateARandomNumberBetween1and100();
	if (n == 100) {
		context.write(key, value);
	}
}

The above would somewhat randomly output 1 record out of every 100, up 
to a specified maximum amount desired, and discard all the rest.

HTH,

DR

On 06/27/2011 03:28 PM, Niels Basjes wrote:
> The only solution I can think of is by creating a counter in Hadoop
> that is incremented each time a mapper lets a record through.
> As soon as the value reaches a preselected value the mappers simply
> discard the additional input they receive.
>
> Note that this will not at all be random.... yet it's the best I can
> come up with right now.
>
> HTH
>
> On Mon, Jun 27, 2011 at 09:11, Jeff Zhang<zjffdu@gmail.com>  wrote:
>>
>> Hi all,
>> I'd like to select random N records from a large amount of data using
>> hadoop, just wonder how can I archive this ? Currently my idea is that let
>> each mapper task select N / mapper_number records. Does anyone has such
>> experience ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang

Mime
View raw message