Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63B6D49E6 for ; Mon, 27 Jun 2011 19:29:15 +0000 (UTC) Received: (qmail 1404 invoked by uid 500); 27 Jun 2011 19:29:14 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 1355 invoked by uid 500); 27 Jun 2011 19:29:13 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 1343 invoked by uid 99); 27 Jun 2011 19:29:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2011 19:29:13 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2011 19:29:04 +0000 Received: by qwj9 with SMTP id 9so3509254qwj.35 for ; Mon, 27 Jun 2011 12:28:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.224.75 with SMTP id in11mr60438qcb.211.1309202923349; Mon, 27 Jun 2011 12:28:43 -0700 (PDT) Sender: niels@basj.es Received: by 10.229.211.134 with HTTP; Mon, 27 Jun 2011 12:28:43 -0700 (PDT) X-Originating-IP: [83.81.46.228] In-Reply-To: References: Date: Mon, 27 Jun 2011 21:28:43 +0200 X-Google-Sender-Auth: dWm2D5MHfGcrYuG4sejvgdnz4u8 Message-ID: Subject: Re: How to select random n records using mapreduce ? From: Niels Basjes To: mapreduce-user@hadoop.apache.org Cc: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random.... yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote: > > Hi all, > I'd like to select random N records from a large amount of data using > hadoop, just wonder how can I archive this ? Currently my idea is that let > each mapper task select N / mapper_number records. Does anyone has such > experience ? > > -- > Best Regards > > Jeff Zhang > -- Best regards / Met vriendelijke groeten, Niels Basjes