hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: block-size vs split-size
Date Tue, 27 Nov 2012 17:56:37 GMT
Hi,

Response inline.

On Tue, Nov 27, 2012 at 8:35 PM, Kartashov, Andy <Andy.Kartashov@mpac.ca> wrote:
> Guys,
>
> I understand that if not specified, default block size of HDFs is 64Mb. You
> can control this value by altering dfs.block.size property and increasing to
> value to 64Mb x 2 or 64Mb x 4.. Every time we make the change to this
> property we must reimport the data for the changes to take effect

Yes, to change the DFS level file block size, the file has to be
re-written with the new value.

> My question is about split size. I understand it is used by MapReduce to
> assign task to tasktrackers:
>
> 1.       Is split size (if not specified in the property
> mapred.min.split.size) by default equals to the  default block-size or 64Mb?

If unspecified, the MR framework does rely on the retrieved block
sizes of the file to create the tasks. So yes, it defaults to the
file's block size if you use HDFS.

> 2.       If you increased block size say to 128Mb? Will split size (if not
> specified) be equal to 128Mb of blocks size or will it remain at 64Mb of
> default block size?

Since (1) is true as an untouched default, the split size will also
change to 128 MB.

> 3.       If, say, your input file is 128Mb. At default 64Mb block-size you
> will get the file totalling two blocks. Your JobTracker will create two map
> masks – one per block)… but what if you specified mapred.min.split.size at
> 128Mb? I suppose, despite the fact there are 2 blocks in the HDFS there will
> be only one input split for the Mapreduce. Is mapred.min.split.size designed
> to override the block-size property when preparing inputSplits?

Correct.

128 MB file, with 64 MB block size --> Defaults --> 2 tasks
128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 task

Yes the property is designed to override the simple default of block
size -> input split size mapping.

> 4.       Given the ability to set  block-size and split – size individually,
> was the main purpose of this to gain better control of one property over
> another??  If my understanding on the 3rd point correct. Then, say, you
> imported  your data at 128Mb block-size but later realised you should have
> gone higher… instead of re-importing all your data at 256Mb per block, you
> can change split size property to 256Mb. Am I grasping this concept
> correctly?

While you get the concept of splits right, there's one point about
locality to consider:

You could do that, but you would lose locality. The reason the default
split algorithm sticks to block boundaries is such that each task
individually processes one block alone, and the scheduler can do a
more effective job in making the task run where this individual block
resides.

When you override min-spit-size and make the split carry two blocks
worth of offset + length, then the two blocks could be residing at
different nodes but the task will run only at one node, leading to
non-data-local processing, which could end up being slower.

> 5.

You missed your 5th question?

--
Harsh J

Mime
View raw message