hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "W.P. McNeill" <bill...@gmail.com>
Subject Preferred ways to specify input and output directories to Hadoop jobs
Date Wed, 08 Feb 2012 18:00:55 GMT
How do you like to specify input and output directories to your Hadoop jobs?

I have been using positional arguments. All but the last argument are input
directories and the last one is an output directory. These override
any mapred.output.dir configuration parameter and augment
any mapred.input.dir. I like positional arguments because it's a very
natural UNIXy way of doing things. However, the more I use this convention,
the more complex it seems to me. For instance, you have to decide what to
do when there's only one positional argument. Or maybe there are scenarios
in which you want the positional input directories to overwrite the
configurational ones. More generally, you have to figure out how to
reconcile positional and configurational arguments. Now I'm leaning towards
only using the mapred.input.dir and mapred.output.dir parameters.

What do other people do?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message