hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Mounting HDFS as local file system
Date Thu, 02 Dec 2010 15:02:14 GMT

On Dec 2, 2010, at 8:52 AM, Mark Kerzner wrote:

> Thank you, Brian.
> 
> I found your paper "Using Hadoop as grid storage," and it was very useful.
> 
> One thing I did not understand in it is your file usage pattern - do you
> deal with small or large files, and do you delete them often enough? My
> question was, in part, can you use HDFS as a regular file system with
> frequent file deletes? Does it not become fragmented and unreliable?
> 

We don't have any fragmentation issues.  We frequently delete files (we're supposed to be
able to turn over 500TB in 2 weeks).  We use quotas and have daily monitoring to watch for
users who abuse the system.  The only directories without quotas are the ones we populate
centrally; user directories (who we don't control) can quite easily get 1-20TB, but have to
provide a strong justification to get more than 10k files.

Because HDFS has limited write semantics (but close enough to POSIX read semantics) our users
love it, but understand it's "special".

It's been a matter of user training:
- Do you want high performance storage that can do lots of small files?  If so, the cost is
$X / TB.
- Do you want high throughput storage where you have limited write semantics and need to use
large files?  If so, the cost is $Y / TB.
X is roughly 5-10x Y, so the group leaders can budget appropriately.  We then purchase Hadoop
and our Other Storage System in appropriate amounts.

User education goes a long way.  However, if they don't want to be bothered to be educated,
they can always pay more money.

Brian

> Thank you,
> Mark
> 
> On Thu, Dec 2, 2010 at 7:10 AM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:
> 
>> 
>> On Dec 2, 2010, at 5:16 AM, Steve Loughran wrote:
>> 
>>> On 02/12/10 03:01, Mark Kerzner wrote:
>>>> Hi, guys,
>>>> 
>>>> I see that there is MountableHDFS<
>> http://wiki.apache.org/hadoop/MountableHDFS>,
>>>> and I know that it works, but my questions are as follows:
>>>> 
>>>>   - How reliable is it for large storage?;
>>> 
>>> Shouldn't be any worse than normal HDFS operations.
>>> 
>>>>   - Is it not hiding the regular design questions - we are dealing with
>>>>   NameServers after all, but are trying to use it as a regular file
>> system?
>>>>   - For example, HDFS is not optimized for many small files that get
>>>>   written and deleted, but a mounted system will lure one in this
>> direction.
>>> 
>>> Like you say, it's not a conventional posix fs, it hates small files,
>> where other things may be better.
>> 
>> I would comment that it's extremely reliable.  There's at least one slow
>> memory leak in fuse-dfs that I haven't been able to squash, and I typically
>> remount things after a month or two of *heavy* usage.
>> 
>> Across all the nodes in our cluster, we probably do a few billion HDFS
>> operations per day over FUSE.
>> 
>> Brian


Mime
View raw message