hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: Preferred ways to specify input and output directories to Hadoop jobs
Date Wed, 08 Feb 2012 18:13:59 GMT
      When you give in the arguments on CLI in your driver class you are making it assign
to mapred.input.dir and mapred.output.dir . I believe no such  default exists in map reduce
frame work that would assign the position arguments to input and output dir.  If you don't
want this assignment in your driver class from the arguments, you can specify the same from
CLI as -D mapred.input.dir = myInputDir and -D mapred.output.dir = myOutputDir . In both cases
you are doing the same, no difference.
Choose any that is comfortable for you.
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: "W.P. McNeill" <billmcn@gmail.com>
Date: Wed, 8 Feb 2012 10:00:55 
To: Hadoop Mailing List<common-user@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: Preferred ways to specify input and output directories to Hadoop jobs

How do you like to specify input and output directories to your Hadoop jobs?

I have been using positional arguments. All but the last argument are input
directories and the last one is an output directory. These override
any mapred.output.dir configuration parameter and augment
any mapred.input.dir. I like positional arguments because it's a very
natural UNIXy way of doing things. However, the more I use this convention,
the more complex it seems to me. For instance, you have to decide what to
do when there's only one positional argument. Or maybe there are scenarios
in which you want the positional input directories to overwrite the
configurational ones. More generally, you have to figure out how to
reconcile positional and configurational arguments. Now I'm leaning towards
only using the mapred.input.dir and mapred.output.dir parameters.

What do other people do?

View raw message