hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: How many records will be passed to a map function??
Date Fri, 18 Jun 2010 23:27:43 GMT

In general, you should let Hadoop pick the number of mappers to use.
In the case of only 1000 records @ 12k, performance will be better
with a single mapper for IO bound jobs. When you force the number of
map tasks, Hadoop will do the following:

(Assuming FileInputFormat#getSplits(conf, numSplits) gets called)

totalSize is sum size of all input files in bytes
goalSize is totalSize / numSplits
minSplitSize is conf value mapred.min.split.size (default 1)

For each input file:
  length = file.size()
  while isSplitable(file) and length != 0
    fileBlockSize is the block size of the file
    minOfGoalBlock is min(goalSize, fileBlockSize)
    realSplitSize is max(minSplitSize, minOfGoalBlock)

    length is length minus realSplitSize (give or take)

Note that it's actually more confusing than this, but this is the
general idea. Let's plug in some numbers:

1 file
totalSize = 12k file size
blockSize = 64MB block
numSplits = 2
goalSize = 6k (12k / 2)
minSplitSize = 1 (for FileInputFormat)

minOfGoalBlock = 6k (6k < 64MB)
realSplitSize = 6k (6k > 1)

We end up with 2 splits, 6k each. RecordReaders then parse this into records.

Note that this applies to the old APIs. The newer APIs work slightly
different but I think the result is equivalent.

(If anyone wants to double check my summation, I welcome it. This is
some hairy code and these questions frequently come up.)

Hope this helps.

On Wed, Jun 16, 2010 at 8:10 AM, Karan Jindal
<karan_jindal@students.iiit.ac.in> wrote:
> Hi all,
> Given a scenario in which a input file contains total 1000 records (record
> in a line) of total size 12k and I set number of map tasks to 2.
> How many records will be passed to each map task? Is it the equal
> distribution?
> InputFormat = Text
> Block size  = default block of hdfs
> Hoping for a reply..
> Regards
> Karan
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.

Eric Sammer
twitter: esammer
data: www.cloudera.com

View raw message