hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florin P <florinp...@yahoo.com>
Subject Computing the number of mappers when CombineFileInputFormat is used
Date Fri, 12 Aug 2011 06:35:42 GMT
   I would like to know how Hadoop is computing the number of mappers when CombineFileInputFormat
is used? I have read the API specification for CombineFileInputFormat (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html),
but unfortunately I could not understand the way that the input splits are computed.
We have a cluster with 10 Data nodes, and data files (mapfile) spread over them.
We have used this InputFormat in our M/R jobs. Reading the spec, we have to distinguish betwenn
three scenarios:
  1. if the  mapred.max.split.size property it is not specfied, then we will have one mapper.
This behavior is correct regarding the spec:
"If maxSplitSize is not specified, then blocks from the same rack are combined in a single
split; no attempt is made to create node-local splits"
2. mapred.max.split.size value specified and equals with the the block size (our case 64M).
According to spec: "If the maxSplitSize is equal to the block size, then this class is similar
to the default spliting behaviour in Hadoop: each block is a locally processed split."
Question: So for my understanding, in this case, the number of splits it is calculated the
same as in the case when you don't use CombineFileInputFormat?
3. According to spec: "If a maxSplitSize is specified, then blocks on the same node are combined
to form a single split. Blocks that are left over are then combined with other blocks in the
same rack"
   Question 1: From the above, I have understood that the number of splits is equal with the
number of nodes ("blocks on the same node are combined to form a single split"). I have observed
that is not the case. So how you compute?
  Question 2. What is the best practice to set up the mapred.max.split.size(maxSplitSize)
greater or equal with the block size?
  (In my opinion, I'll use the same size as block size in order do not loose data locality,
but please correct me if I'm wrong)

In the spec, it is stated that "A split cannot have files from different pools". What means
pool? A datanode?

I'll look forward for your answers.
Thank you,

View raw message