hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Total input paths number and output
Date Sat, 02 Oct 2010 18:50:44 GMT
On Sat, Oct 2, 2010 at 11:35 PM, Shi Yu <shiyu@uchicago.edu> wrote:
> On 2010-10-2 12:01, Harsh J wrote:
>>
>> mapred.min.split.size and minimum map tasks properties of Hadoop MR also
>> control the splitting of input for map talks.
>>
>> On Oct 2, 2010 10:28 PM, "Harsh J"<qwertymaniac@gmail.com>  wrote:
>>
>> Outputs are not dependent on number of inputs, but instead the number of
>> reducers (if MapReduce) or number of input splits if just plain Maps.
>>
>> The number of splits is determined in most cases by the input file sizes
>> and
>> the set HDFS block size factor (dfs.block.size) it was created under.
>>
>>
>>
>>>
>>> On Oct 2, 2010 10:01 PM, "Shi Yu"<shiyu@uchicago.edu>  wrote:
>>>
>>> Hi,
>>>
>>> I am running some cod...
>>>
>>
>>
>
> Hi Harsh,
>
> Thanks for the answer. I understand what you have said. However, I was
> trying to see the effect in experiment. For example, I use the exact same
> input (a 13M file) and try the simple WordCount example. I would like to see
> whether my configuration could change the number appeared in the log. The
> configuration in my main function is as follows:
>
>          JobConf conf = new JobConf(WordCount.class);
>          conf.setJobName("wordcount");
>          conf.setOutputKeyClass(Text.class);
>          conf.setOutputValueClass(IntWritable.class);
>          conf.setMapperClass(Map.class);
>          conf.setCombinerClass(Reduce.class);
>          conf.setReducerClass(Reduce.class);
>          conf.setMapOutputKeyClass(Text.class);
>          conf.setMapOutputValueClass(IntWritable.class);
>          conf.setInputFormat(ZipInputFormat.class);
>          conf.setInt("mapred.min.split.size",2);
The property "mapred.min.split.size" takes up its value in bytes. Some
input formats have their own splitting techniques, so also know that
it is not an enforced setting.
>          conf.setNumMapTasks(3);
For information's sake, by default, mapred.map.tasks is set as 2 in
Hadoop MR. It is considered as a hint since the input size / files
determine number of required maps but  with lesser data it still runs
the minimum set amount of maps (in order to use your cluster or
machine efficiently I s'pose).
>
> In the last two lines (mapred.min.split.size  and setNumMapTasks) I set
> different values, from 2 to 10.  But the log is always
>
> INFO mapred.FileInputFormat: Total input paths to process : 1
>
>
> Then I change to my real code using the exact same input, I set
>      conf.setNumMapTasks(1);
>      conf.setNumReduceTasks(1);
>
> The log shows
> INFO mapred.FileInputFormat: Total input paths to process : 2
I find this odd, FileInputFormat reports only the number of paths it
has to process under the directory it has got. If you specify a file
directly, it must not report 2.

Unless its a doing of the ZipInputFormat where-in (I assume) it
reports number of files inside the zip file?
>
> What's wrong? Why I cannot see the direct effect of my settings. The input
> file is 13M so it is smaller than the default block size 64M. I leave that
> block size setting by default.
>
> Thanks.
>
> Best Regards,
>
> Shi
>
>

I was replying from a mobile device earlier so couldn't be very clear,
apologies.

What you're asking for is a way to control the number of outputs,
correct? The number of input paths detected or the maps launched for
the input are not the determining methods of the final output when it
comes to jobs that have a Reduce phase.

If you want a single-file output, you'd set job.setNumReduceTasks(1);
and so on for as many as you need. Usually the property
mapred.reduce.tasks (which the above method sets anyway) is set to a
prime number nearest to the number of tasktracker nodes. Although it
is not a necessity to do so, it helps parallelize the operation in a
neat manner.

About controlling the input split behavior, it depends on the
InputFormat derivative you are using. FileInputFormats generate
minimum of n splits for n files but may run n+m mappers based on the
factoring of the files as-per the block size (or mapred.min.split.size
if set to a valid number other than 0, as it works with FIF). But yes
the "Total input paths to process" message it logs is basically the
size of the array of files it found valid under the path or list of
paths you supplied (FIF ignores . and _ prefixes if am right, and
doesn't count a dir).

Are you sure that the directory which you are passing to FIF has only
one file under it? Or perhaps the ZipInputFormat has its own
path-listing techniques?

-- 
Harsh J
www.harshj.com

Mime
View raw message