hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: HDFS blocks
Date Fri, 27 Jun 2008 19:10:25 GMT


lohit wrote:
>> 1. Can we have multiple files in DFS use different block sizes ?
> No, current this might not be possible, we have fixed sized blocks.

Actually you can. HDFS provides api to specify block size
when you create a file. Here is the link
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long,%20org.apache.hadoop.util.Progressable)
This should probably be in H-FAQ.

>> 2. If we use default block size for these small chunks, is the DFS space
>> wasted ?
>  
> DFS space is not wasted, all the blocks are stored on individual datanode's filesystem
as is. But you would be wasting NameNode's namespace. NameNode holds the entire namespace
in memory, so, instead of using 1 file with 128M block if you do multiple files of size 6M
you would be having so many entries.
> 
>> If not then does it mean that a single DFS block can hold data from
>> more than one file ?
> DFS Block cannot hold data from more than one file. If your file size say 5M which is
less than your default block size say 128M, then the block stored in DFS would be 5M alone.
> 
> To over come this, ppl usually run a map/reduce job with 1 reducer and Identity mapper,
which basically merges all small files into one file. In hadoop 0.18 we have archives and
once HADOOP-1700 is done, one could open the file to append to it.
> 
> Thanks,
> Lohit
> 
> 
> ----- Original Message ----
> From: "Goel, Ankur" <Ankur.Goel@corp.aol.com>
> To: core-user@hadoop.apache.org
> Sent: Friday, June 27, 2008 2:27:57 AM
> Subject: HDFS blocks
> 
> 
> Hi Folks,
>         I have a setup where in I am streaming data into HDFS from a
> remote location and creating a new files every X min. The file generated
> is of a very small size (512 KB - 6 MB) size. Since that is the size
> range the streaming code sets the block size to 6MB whereas default that
> we have set for the cluster is 128 MB. The idea behind such a thing is
> to generate small temporal data chunks from multiple sources and merge
> them periodically into a big chunk with our default (128 MB) block size.
> 
> The webUI for DFS reports the block size for these files to be 6 MB. My
> questions are.
> 1. Can we have multiple files in DFS use different block sizes ?
> 2. If we use default block size for these small chunks, is the DFS space
> wasted ? 
>    If not then does it mean that a single DFS block can hold data from
> more than one file ?
> 
> Thanks
> -Ankur
> 
> 

Mime
View raw message