hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <yiqi.t...@barclayscapital.com>
Subject RE: HDFS questions
Date Fri, 20 Nov 2009 19:19:55 GMT
Thanks for the replies.

-----Original Message-----
From: Allen Wittenauer [mailto:awittenauer@linkedin.com] 
Sent: Friday, November 20, 2009 1:39 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS questions

On 11/19/09 5:44 PM, "yiqi.tang@barclayscapital.com"
<yiqi.tang@barclayscapital.com> wrote:
> 1. It looks to be a user-space clustering file system, which means 
> root access is never needed during installation and on-going usage,

It is a user-space file system, however, the user that is running the
NameNode process is considered 'root' not the UNIX level root.  It
should also be pointed out that HDFS is not secure, so tricking the
system to give anyone root privileges within HDFS is beyond trivial.
Using HDFS for SOX data is a very difficult business.

> 2. It looks to have a software raid scheme, and the default is 3 
> copies, so if I have 3 boxes with 500gb disks each, the usable space 
> is just 500gb right?

~500gb, yes, if you ignore the MapReduce component.

> 3. If I just have 1 box of 500gb disk, what is the usable space?

It depends upon the configuration.  If you run three different DataNode
processes pointing to three different directories with a replication
factor of 3, then it is ~166gb.  If you configure one DataNode process
with a replication factor of 1, then you get ~500gb.

Running 'real' workloads on one machine is not recommended however.

> 4. Does the HDFS show up in Linux OS as multiple physical files?


> Of what size? If I change a file of 1 byte inside HDFS, I assume the 
> entire physical file on Linux OS is changed right? This will really 
> mess with tape backups...

A) HDFS is fairly WORM:  there is no random IO, so you aren't going to
be changing 1 byte.  You will be changing the whole file.

B) Files at the UNIX level are written up to the block size.  So if the
block size is 128MB and you write a 512MB file in HDFS, you'll have 3x4
128MB size files at the UNIX level.

C) I don't think anyone does tape backups of HDFS.  The expectation is
that the source of truth is elsewhere and that is getting backed up or
you have
two+ grids that have copies of each others data.

I'd recommend going through some of the operations presentations on
http://wiki.apache.org/hadoop/HadoopPresentations .  It might help
answer some of the how's and why's. :)


This e-mail may contain information that is confidential, privileged or otherwise protected
from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and notify the sender that
you have received it in error. Unless specifically indicated, this e-mail is not an offer
to buy or sell or a solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any transaction, or an official
statement of Barclays. Any views or opinions presented are solely those of the author and
do not necessarily represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent
to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC,
a company registered in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays

View raw message