hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hendler <hendler...@yahoo.com>
Subject Re: HDFS instead of NFS/NAS/DAS?
Date Wed, 26 Sep 2007 09:28:34 GMT
Hi Dhruba, All,

Thanks for the feedback.
It would be under 14 million files, I would expect.

The read/write question is more tricky, although it's not a read only
archive - functions more of a repository where files are "checked-out"
and when they are checked in files will be updated either in part or
whole. (updating in part would be more efficient in terms of network IO
I assume). Files would likely be accessed 10-200 times a day, with only
1/10 - 1/100th of the total being accessed during a course of a day.

So, it sounds like I could change the default block size to 1MB, and
write a MapReduce that simply reads/writes the file. I assume one block
is replicated across a few machines.

Is there a example code, which you are aware of, for using HDFS for this
purpose?  [1] Or maybe HDFS isn't designed for this task.


[1] http://wiki.apache.org/lucene-hadoop/LibHDFS points to your code for
up, so would my simple use case  be a C application   or was this code simply a
single machine test?

Also, interested if anyone had comparative thoughts on
http://www.danga.com/mogilefs/ ?

Dhruba Borthakur wrote:
> Hi Jonathan,
> Thanks for asking this question. I think all your four requirements are
> satisfied by HDFS. The one issue that I have is HDFS is not designed to
> support a large number of small files, rather fewer number of larger files.
> For example, the default block size is 64MB (is configurable).
> That said, version 0.13 is tested to store about 14 million files. Version
> 0.15 (to be released in October) should support about 4 times that number.
> This limit will probably increase every passing week, but you should
> consider this limitation for evaluating HDFS.
> May I ask how many files you might have? How does the number of files grow
> over time? How frequently are files accessed? Are you going to use HDFS as a
> read-only archival system?
> Thanks,
> dhruba
> -----Original Message-----
> From: Jonathan Hendler [mailto:hendlerman@yahoo.com] 
> Sent: Thursday, September 20, 2007 8:25 PM
> To: hadoop-user@lucene.apache.org
> Subject: HDFS instead of NFS/NAS/DAS?
> Hi All,
> I am a complete newbie to Hadoop, not having tested or installed yet,
> but reading up for about a month now in spare time, and following the
> list. I think it's really exciting to provide this kind of
> infrastructure as open source!
> I'll provide context for the subject of this email, and although I've
> seen a thread  or two about storing many small files in Hadoop, I'm not
> sure it addresses the following.
> Goal:
>    1. Many small files (from 1MB-2GB) 
>    2. Automated "fail-safe" redundancy
>    3. Automated synchronization of the redundancy
>    4. predictable speed as load / server count increases for read/write
>       of these files (in part or whole)
> The middleware having access to the files could be used, among other
> things, to:
>    1. track "where the files are", and their states
>    2. sync differences 
> My thinking is that by splitting parts of these files, even if small,
> across a number of machines, CRUD will be faster than NFS, as well as
> "safer". Also, I'm thinking that using HDFS would be cheaper than DAS /
> and more feature rich than NAS [1]. Also, it wouldn't matter "where" the
> files were in HDFS, which would simplify the complexity of the
> middleware. I also read that DHTs generally don't have intelligent load
> balancing, making HDFS type schemes more consistent.
> Since Hadoop is primarily designed to move the computation to where the
> data is, does it make sense to use HDFS in this way?[2] 
> - Jonathan
> [1] - http://en.wikipedia.org/wiki/Network-attached_storage#Drawbacks
> [2] - (assuming the memory limit in the master isn't reached because a
> large number of files/blocks)

View raw message