hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruba Borthakur" <dhr...@yahoo-inc.com>
Subject RE: HDFS instead of NFS/NAS/DAS?
Date Wed, 26 Sep 2007 22:12:07 GMT
HDFS should work well for your case. I would go with the default block size,
even though your files are quite small in size.

Thanks,
dhruba

-----Original Message-----
From: Jonathan Hendler [mailto:hendlerman@yahoo.com] 
Sent: Wednesday, September 26, 2007 2:29 AM
To: hadoop-user@lucene.apache.org
Cc: Dhruba Borthakur
Subject: Re: HDFS instead of NFS/NAS/DAS?

Hi Dhruba, All,

Thanks for the feedback.
It would be under 14 million files, I would expect.

The read/write question is more tricky, although it's not a read only
archive - functions more of a repository where files are "checked-out"
and when they are checked in files will be updated either in part or
whole. (updating in part would be more efficient in terms of network IO
I assume). Files would likely be accessed 10-200 times a day, with only
1/10 - 1/100th of the total being accessed during a course of a day.

So, it sounds like I could change the default block size to 1MB, and
write a MapReduce that simply reads/writes the file. I assume one block
is replicated across a few machines.

Is there a example code, which you are aware of, for using HDFS for this
purpose?  [1] Or maybe HDFS isn't designed for this task.

Best,
Jonathan

[1] http://wiki.apache.org/lucene-hadoop/LibHDFS points to your code for
libhdfs,
http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs/hdfs_te
st.c?view=mark
up, so would my simple use case  be a C application   or was this code
simply a
single machine test?


Also, interested if anyone had comparative thoughts on
http://www.danga.com/mogilefs/ ?

Dhruba Borthakur wrote:
> Hi Jonathan,
>
> Thanks for asking this question. I think all your four requirements are
> satisfied by HDFS. The one issue that I have is HDFS is not designed to
> support a large number of small files, rather fewer number of larger
files.
> For example, the default block size is 64MB (is configurable).
>
> That said, version 0.13 is tested to store about 14 million files. Version
> 0.15 (to be released in October) should support about 4 times that number.
> This limit will probably increase every passing week, but you should
> consider this limitation for evaluating HDFS.
>
> May I ask how many files you might have? How does the number of files grow
> over time? How frequently are files accessed? Are you going to use HDFS as
a
> read-only archival system?
>
> Thanks,
> dhruba
>
> -----Original Message-----
> From: Jonathan Hendler [mailto:hendlerman@yahoo.com] 
> Sent: Thursday, September 20, 2007 8:25 PM
> To: hadoop-user@lucene.apache.org
> Subject: HDFS instead of NFS/NAS/DAS?
>
> Hi All,
>
> I am a complete newbie to Hadoop, not having tested or installed yet,
> but reading up for about a month now in spare time, and following the
> list. I think it's really exciting to provide this kind of
> infrastructure as open source!
>
> I'll provide context for the subject of this email, and although I've
> seen a thread  or two about storing many small files in Hadoop, I'm not
> sure it addresses the following.
>
> Goal:
>
>    1. Many small files (from 1MB-2GB) 
>    2. Automated "fail-safe" redundancy
>    3. Automated synchronization of the redundancy
>    4. predictable speed as load / server count increases for read/write
>       of these files (in part or whole)
>
> The middleware having access to the files could be used, among other
> things, to:
>
>    1. track "where the files are", and their states
>    2. sync differences 
>
> My thinking is that by splitting parts of these files, even if small,
> across a number of machines, CRUD will be faster than NFS, as well as
> "safer". Also, I'm thinking that using HDFS would be cheaper than DAS /
> and more feature rich than NAS [1]. Also, it wouldn't matter "where" the
> files were in HDFS, which would simplify the complexity of the
> middleware. I also read that DHTs generally don't have intelligent load
> balancing, making HDFS type schemes more consistent.
>
> Since Hadoop is primarily designed to move the computation to where the
> data is, does it make sense to use HDFS in this way?[2] 
>
> - Jonathan
>
> [1] - http://en.wikipedia.org/wiki/Network-attached_storage#Drawbacks
> [2] - (assuming the memory limit in the master isn't reached because a
> large number of files/blocks)
>
>
>
>   




Mime
View raw message