hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Foley <ma...@yahoo-inc.com>
Subject Re: Block size in HDFS
Date Fri, 10 Jun 2011 19:05:51 GMT
Pedro,
You need to distinguish between "HDFS" files and blocks, and "Low-level Disk" files
and blocks.

Large HDFS files are broken into HDFS blocks and stored in multiple Datanodes.
On the Datanodes, each HDFS block is stored as a Low-level Disk file.
So if you have the block size set to 64MB, then a 70MB HDFS file would be split into
a 64MB block and a 6MB final block.  Each of those blocks will become a disk file
on one or more datanodes (depending on replication settings), and will take up 
however much disk storage is needed for that disk file.  Of course we could have 
just said "save a 64MB chunk for each block", but that would have been wasteful.  
Instead, it just uses as much disk as is actually needed for the amount of data in 
that block.

Obviously, only the last block in an HDFS file can be smaller than the block size.

It's worth mentioning that Low-level Disk "blocks" are different.  Because of the way
hard drive hardware works, disk blocks are fixed size, typically either 4KB or 8KB.
It is impossible to allocate less than a full disk block of low-level disk storage.
But this constraint does not apply to HDFS blocks, which are higher-level constructs.

--Matt


On Jun 10, 2011, at 9:13 AM, Philip Zeyliger wrote:

On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa <psdc1978@gmail.com> wrote:
> This means that, when HDFS reads 1KB file from the disk, he will put
> the data in blocks of 64MB?

No.

> 
> On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger <philip@cloudera.com> wrote:
>> On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa <psdc1978@gmail.com> wrote:
>>> But, how can I say that a 1KB file will only use 1KB of disc space, if
>>> a block is configured has 64MB? In my view, if a 1KB use a block of
>>> 64MB, the file will occupy 64MB in the disc.
>> 
>> A block of HDFS is the unit of distribution and replication, not the
>> unit of storage.  HDFS uses the underlying file systems for physical
>> storage.
>> 
>> -- Philip
>> 
>>> 
>>> How can you disassociate a  64MB data block from HDFS of a disk block?
>>> 
>>> On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz <mlortiz@uci.cu> wrote:
>>>> On 06/10/2011 10:35 AM, Pedro Costa wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB
>>>> file, this file will ocupy 64MB in the HDFS?
>>>> 
>>>> Thanks,
>>>> 
>>>> HDFS is not very efficient storing small files, because each file is stored
>>>> in a block (of 64 MB in your case), and the block metadata
>>>> is held in memory by the NN. But you should know that this 1KB file only
>>>> will use 1KB of disc space.
>>>> 
>>>> For small files, you can use Hadoop archives.
>>>> Regards
>>>> 
>>>> --
>>>> Marcos Luís Ortíz Valmaseda
>>>>  Software Engineer (UCI)
>>>>  http://marcosluis2186.posterous.com
>>>>  http://twitter.com/marcosluis2186
>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> --
> ---------------------------
> Pedro Sá da Costa
> 
> @: pcosta@lasige.di.fc.ul.pt
> @: psdc1978@gmail.com
> 


Mime
View raw message