hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Jeff.Schm...@shell.com>
Subject RE: How to select random n records using mapreduce ?
Date Mon, 27 Jun 2011 20:00:38 GMT
Wait - Habermaas like in Critical Theory????

-----Original Message-----
From: Habermaas, William [mailto:William.Habermaas@fatwire.com] 
Sent: Monday, June 27, 2011 2:55 PM
To: common-user@hadoop.apache.org
Subject: RE: How to select random n records using mapreduce ?

I did something similar.  Basically I had a random sampling algorithm
that I called from the mapper. If it returned true I would collect the
data, otherwise I would discard it. 


-----Original Message-----
From: niels@basj.es [mailto:niels@basj.es] On Behalf Of Niels Basjes
Sent: Monday, June 27, 2011 3:29 PM
To: mapreduce-user@hadoop.apache.org
Cc: core-user@hadoop.apache.org
Subject: Re: How to select random n records using mapreduce ?

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random.... yet it's the best I can
come up with right now.


On Mon, Jun 27, 2011 at 09:11, Jeff Zhang <zjffdu@gmail.com> wrote:
> Hi all,
> I'd like to select random N records from a large amount of data using
> hadoop, just wonder how can I archive this ? Currently my idea is that
> each mapper task select N / mapper_number records. Does anyone has
> experience ?
> --
> Best Regards
> Jeff Zhang

Best regards / Met vriendelijke groeten,

Niels Basjes

View raw message