hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Syed Wasti <mdwa...@hotmail.com>
Subject RE: Data Block Size ?
Date Thu, 15 Jul 2010 19:13:34 GMT

Thank you Allen.
So, is it fair to assume that if I have smaller block size (64 MB), then my blocks are distributed
across more datanodes and because my blocks are around more datanodes, then my map jobs should
also run on different datanodes and becuase the maps size will be smaller, it should execute
faster using less resources. 
Should this work this way ? or is there any algorithm on how the blocks should be distributed
across the datanodes and where should the replication copies should go ?

Lets say, I have a file of 640 MB and a cluster with 5 datanodes and configured the block
size to be 64 MB. How will this be distributed ? 

Syed Wasti

> From: awittenauer@linkedin.com
> To: general@hadoop.apache.org
> Subject: Re: Data Block Size ?
> Date: Thu, 15 Jul 2010 18:49:04 +0000
> On Jul 15, 2010, at 11:40 AM, Syed Wasti wrote:
> > Will it matter what the data block size is ? 
> Yes.
> > It is recommended to have a block size of 64 MB, but if we want to have the data
block size to 128 MB, should this effect the performance ?
> Yes.
> FWIW, we run with 128MB.
> > Does the size of the map jobs created on each datanodes in anyway depend the block
size ?
> Yes.
> Unless told otherwise, Hadoop will generally use the # of maps == # of blocks.  So if
you have fewer blocks to process, you'll have fewer maps to do more work.  This is not necessarily
a bad thing; it all depends upon your workload, size of grid, etc.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message