hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Loddengaard <a...@cloudera.com>
Subject Re: Randomize input file?
Date Thu, 21 May 2009 18:15:07 GMT
Hi John,

I don't know of a built-in way to do this.  Depending on how well you want
to randomize, you could just run a MapReduce job with at least one map (the
more maps, the more random) and no reduces.  When you run a job with no
reduces, the shuffle phase is skipped entirely, and the intermediate outputs
from the mappers are stored directly to HDFS.  Though I think each mapper
will create one HDFS file, so you'll have to concatenate all files into a
single file.

The above isn't a very good way to randomize, but it's fairly easy to
implement and should run pretty quickly.

Hope this helps.

Alex

On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarkemjj@gmail.com> wrote:

> Hi,
>
> I have a need to randomize my input file before processing. I understand I
> can chain Hadoop jobs together so the first could take the input file
> randomize it and then the second could take the randomized file and do the
> processing.
>
> The input file has one entry per line and I want to mix up the lines before
> the main processing.
>
> Is there an inbuilt ability I have missed or will I have to try and write a
> Hadoop program to shuffle my input file?
>
> Cheers,
> John
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message