hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: block-size vs split-size
Date Tue, 27 Nov 2012 19:54:15 GMT
Harsh,

   You ruined my hard work ;)
I had written a big reply for Andy and suddenly I got a call from a client.
Meantime you posted your reply. No worries it was almost same as yours.
Anyways, thank you for the detailed explanation. :)

Regards,
    Mohammad Tariq



On Wed, Nov 28, 2012 at 1:20 AM, Kartashov, Andy <Andy.Kartashov@mpac.ca>wrote:

> Thanks Harsh. I totally forgot about the locality thing.
>
> I take it, for the best perfomance it is better leave the split size
> property alone and let the framework handle the splits on the basis of the
> block size.
>
> p.s. There were meant to be only 5 questions.
>
> Rgds,
> AK47
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Tuesday, November 27, 2012 12:57 PM
> To: <user@hadoop.apache.org>
> Subject: Re: block-size vs split-size
>
> Hi,
>
> Response inline.
>
> On Tue, Nov 27, 2012 at 8:35 PM, Kartashov, Andy <Andy.Kartashov@mpac.ca>
> wrote:
> > Guys,
> >
> > I understand that if not specified, default block size of HDFs is
> > 64Mb. You can control this value by altering dfs.block.size property
> > and increasing to value to 64Mb x 2 or 64Mb x 4.. Every time we make
> > the change to this property we must reimport the data for the changes
> > to take effect
>
> Yes, to change the DFS level file block size, the file has to be
> re-written with the new value.
>
> > My question is about split size. I understand it is used by MapReduce
> > to assign task to tasktrackers:
> >
> > 1.       Is split size (if not specified in the property
> > mapred.min.split.size) by default equals to the  default block-size or
> 64Mb?
>
> If unspecified, the MR framework does rely on the retrieved block sizes of
> the file to create the tasks. So yes, it defaults to the file's block size
> if you use HDFS.
>
> > 2.       If you increased block size say to 128Mb? Will split size (if
> not
> > specified) be equal to 128Mb of blocks size or will it remain at 64Mb
> > of default block size?
>
> Since (1) is true as an untouched default, the split size will also change
> to 128 MB.
>
> > 3.       If, say, your input file is 128Mb. At default 64Mb block-size
> you
> > will get the file totalling two blocks. Your JobTracker will create
> > two map masks – one per block)… but what if you specified
> > mapred.min.split.size at 128Mb? I suppose, despite the fact there are
> > 2 blocks in the HDFS there will be only one input split for the
> > Mapreduce. Is mapred.min.split.size designed to override the block-size
> property when preparing inputSplits?
>
> Correct.
>
> 128 MB file, with 64 MB block size --> Defaults --> 2 tasks
> 128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 task
>
> Yes the property is designed to override the simple default of block size
> -> input split size mapping.
>
> > 4.       Given the ability to set  block-size and split – size
> individually,
> > was the main purpose of this to gain better control of one property
> > over another??  If my understanding on the 3rd point correct. Then,
> > say, you imported  your data at 128Mb block-size but later realised
> > you should have gone higher… instead of re-importing all your data at
> > 256Mb per block, you can change split size property to 256Mb. Am I
> > grasping this concept correctly?
>
> While you get the concept of splits right, there's one point about
> locality to consider:
>
> You could do that, but you would lose locality. The reason the default
> split algorithm sticks to block boundaries is such that each task
> individually processes one block alone, and the scheduler can do a more
> effective job in making the task run where this individual block resides.
>
> When you override min-spit-size and make the split carry two blocks worth
> of offset + length, then the two blocks could be residing at different
> nodes but the task will run only at one node, leading to non-data-local
> processing, which could end up being slower.
>
> > 5.
>
> You missed your 5th question?
>
> --
> Harsh J
> NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Mime
View raw message