Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 45720 invoked from network); 21 May 2009 18:15:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 May 2009 18:15:30 -0000 Received: (qmail 64733 invoked by uid 500); 21 May 2009 18:15:40 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 64656 invoked by uid 500); 21 May 2009 18:15:40 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 64643 invoked by uid 99); 21 May 2009 18:15:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2009 18:15:40 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.46.28] (HELO yw-out-2324.google.com) (74.125.46.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2009 18:15:31 +0000 Received: by yw-out-2324.google.com with SMTP id 9so704883ywe.29 for ; Thu, 21 May 2009 11:15:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.215.12 with SMTP id n12mr5349865ang.154.1242929709937; Thu, 21 May 2009 11:15:09 -0700 (PDT) In-Reply-To: <4238036a0905210718wc2159cax2f369a905bfe330d@mail.gmail.com> References: <4238036a0905210718wc2159cax2f369a905bfe330d@mail.gmail.com> Date: Thu, 21 May 2009 11:15:07 -0700 Message-ID: <623d9cf40905211115l3ab073fcwc3efa588db62bf22@mail.gmail.com> Subject: Re: Randomize input file? From: Alex Loddengaard To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636988b5cd7756a046a701f83 X-Virus-Checked: Checked by ClamAV on apache.org --001636988b5cd7756a046a701f83 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi John, I don't know of a built-in way to do this. Depending on how well you want to randomize, you could just run a MapReduce job with at least one map (the more maps, the more random) and no reduces. When you run a job with no reduces, the shuffle phase is skipped entirely, and the intermediate outputs from the mappers are stored directly to HDFS. Though I think each mapper will create one HDFS file, so you'll have to concatenate all files into a single file. The above isn't a very good way to randomize, but it's fairly easy to implement and should run pretty quickly. Hope this helps. Alex On Thu, May 21, 2009 at 7:18 AM, John Clarke wrote: > Hi, > > I have a need to randomize my input file before processing. I understand I > can chain Hadoop jobs together so the first could take the input file > randomize it and then the second could take the randomized file and do the > processing. > > The input file has one entry per line and I want to mix up the lines before > the main processing. > > Is there an inbuilt ability I have missed or will I have to try and write a > Hadoop program to shuffle my input file? > > Cheers, > John > --001636988b5cd7756a046a701f83--