hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: HDFS questions
Date Fri, 20 Nov 2009 18:39:04 GMT
On 11/19/09 5:44 PM, "yiqi.tang@barclayscapital.com"
<yiqi.tang@barclayscapital.com> wrote:
> 1. It looks to be a user-space clustering file system, which means root
> access is never needed during installation and on-going usage, right?

It is a user-space file system, however, the user that is running the
NameNode process is considered 'root' not the UNIX level root.  It should
also be pointed out that HDFS is not secure, so tricking the system to give
anyone root privileges within HDFS is beyond trivial.  Using HDFS for SOX
data is a very difficult business.

> 2. It looks to have a software raid scheme, and the default is 3 copies,
> so if I have 3 boxes with 500gb disks each, the usable space is just
> 500gb right?

~500gb, yes, if you ignore the MapReduce component.

> 3. If I just have 1 box of 500gb disk, what is the usable space?

It depends upon the configuration.  If you run three different DataNode
processes pointing to three different directories with a replication factor
of 3, then it is ~166gb.  If you configure one DataNode process with a
replication factor of 1, then you get ~500gb.

Running 'real' workloads on one machine is not recommended however.

> 4. Does the HDFS show up in Linux OS as multiple physical files?


> Of what size? If I change a file of 1 byte inside HDFS, I assume the entire
> physical file on Linux OS is changed right? This will really mess with
> tape backups...

A) HDFS is fairly WORM:  there is no random IO, so you aren't going to be
changing 1 byte.  You will be changing the whole file.

B) Files at the UNIX level are written up to the block size.  So if the
block size is 128MB and you write a 512MB file in HDFS, you'll have 3x4
128MB size files at the UNIX level.

C) I don't think anyone does tape backups of HDFS.  The expectation is that
the source of truth is elsewhere and that is getting backed up or you have
two+ grids that have copies of each others data.

I'd recommend going through some of the operations presentations on
http://wiki.apache.org/hadoop/HadoopPresentations .  It might help answer
some of the how's and why's. :)

View raw message