Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4238036a0905210718wc2159cax2f369a905bfe330d@mail.gmail.com>
References: <4238036a0905210718wc2159cax2f369a905bfe330d@mail.gmail.com>
Date: Thu, 21 May 2009 11:15:07 -0700
Message-ID: <623d9cf40905211115l3ab073fcwc3efa588db62bf22@mail.gmail.com>
Subject: Re: Randomize input file?
From: Alex Loddengaard <alex@cloudera.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636988b5cd7756a046a701f83

--001636988b5cd7756a046a701f83
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi John,

I don't know of a built-in way to do this.  Depending on how well you want
to randomize, you could just run a MapReduce job with at least one map (the
more maps, the more random) and no reduces.  When you run a job with no
reduces, the shuffle phase is skipped entirely, and the intermediate outputs
from the mappers are stored directly to HDFS.  Though I think each mapper
will create one HDFS file, so you'll have to concatenate all files into a
single file.

The above isn't a very good way to randomize, but it's fairly easy to
implement and should run pretty quickly.

Hope this helps.

Alex

On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarkemjj@gmail.com> wrote:

> Hi,
>
> I have a need to randomize my input file before processing. I understand I
> can chain Hadoop jobs together so the first could take the input file
> randomize it and then the second could take the randomized file and do the
> processing.
>
> The input file has one entry per line and I want to mix up the lines before
> the main processing.
>
> Is there an inbuilt ability I have missed or will I have to try and write a
> Hadoop program to shuffle my input file?
>
> Cheers,
> John
>

--001636988b5cd7756a046a701f83--