hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kartashov, Andy" <Andy.Kartas...@mpac.ca>
Subject RE: block-size vs split-size
Date Tue, 27 Nov 2012 19:50:04 GMT
Thanks Harsh. I totally forgot about the locality thing.

I take it, for the best perfomance it is better leave the split size property alone and let
the framework handle the splits on the basis of the block size.

p.s. There were meant to be only 5 questions.


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Tuesday, November 27, 2012 12:57 PM
To: <user@hadoop.apache.org>
Subject: Re: block-size vs split-size


Response inline.

On Tue, Nov 27, 2012 at 8:35 PM, Kartashov, Andy <Andy.Kartashov@mpac.ca> wrote:
> Guys,
> I understand that if not specified, default block size of HDFs is
> 64Mb. You can control this value by altering dfs.block.size property
> and increasing to value to 64Mb x 2 or 64Mb x 4.. Every time we make
> the change to this property we must reimport the data for the changes
> to take effect

Yes, to change the DFS level file block size, the file has to be re-written with the new value.

> My question is about split size. I understand it is used by MapReduce
> to assign task to tasktrackers:
> 1.       Is split size (if not specified in the property
> mapred.min.split.size) by default equals to the  default block-size or 64Mb?

If unspecified, the MR framework does rely on the retrieved block sizes of the file to create
the tasks. So yes, it defaults to the file's block size if you use HDFS.

> 2.       If you increased block size say to 128Mb? Will split size (if not
> specified) be equal to 128Mb of blocks size or will it remain at 64Mb
> of default block size?

Since (1) is true as an untouched default, the split size will also change to 128 MB.

> 3.       If, say, your input file is 128Mb. At default 64Mb block-size you
> will get the file totalling two blocks. Your JobTracker will create
> two map masks – one per block)… but what if you specified
> mapred.min.split.size at 128Mb? I suppose, despite the fact there are
> 2 blocks in the HDFS there will be only one input split for the
> Mapreduce. Is mapred.min.split.size designed to override the block-size property when
preparing inputSplits?


128 MB file, with 64 MB block size --> Defaults --> 2 tasks
128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 task

Yes the property is designed to override the simple default of block size -> input split
size mapping.

> 4.       Given the ability to set  block-size and split – size individually,
> was the main purpose of this to gain better control of one property
> over another??  If my understanding on the 3rd point correct. Then,
> say, you imported  your data at 128Mb block-size but later realised
> you should have gone higher… instead of re-importing all your data at
> 256Mb per block, you can change split size property to 256Mb. Am I
> grasping this concept correctly?

While you get the concept of splits right, there's one point about locality to consider:

You could do that, but you would lose locality. The reason the default split algorithm sticks
to block boundaries is such that each task individually processes one block alone, and the
scheduler can do a more effective job in making the task run where this individual block resides.

When you override min-spit-size and make the split carry two blocks worth of offset + length,
then the two blocks could be residing at different nodes but the task will run only at one
node, leading to non-data-local processing, which could end up being slower.

> 5.

You missed your 5th question?

Harsh J
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and
may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not
the intended recipient, please delete and contact the sender immediately. Please consider
the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe
qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts
par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite.
Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
View raw message