hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: interface to HDFS
Date Fri, 22 May 2009 10:24:02 GMT
Mike,

webdav should work for you, see:

https://issues.apache.org/jira/browse/HADOOP-496
http://www.hadoop.iponweb.net/Home/hdfs-over-webdav/webdav-server

That said, note that HDFS is not optimized for handling lots of small 
files, it will store them fine, and diskspace will not be wasted, but 
the NameNode will not scale very well (the default block size is 64MB 
and HTML docs are way smaller than that). See this blog post for hints 
on how to work with many small files:

http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

Cheers,

-- amr

Mike Anderson wrote:
> Hello, I'm working on a hadoop project where my data is comprised of many
> HTML files (websites). One aspect of the project involves traditional
> MapReduce analysis on the data set, but I would also like to use hadoop as a
> sort of "cache server," i.e, having the ability to retrieve the HTML for a
> website that I have already been to.
>
> My question is this: what is the best way to interact with HDFS to make
> simple existance queries and retrieve specific files for reading. Ideally I
> would like to do this at an application level, (most likely written in
> Ruby). So far I have explored the option of using one of the FUSE packages
> to mount it in the userspace, but, I ran into quite a bit of difficulty
> installing either of the two popular packages. My second option seems to be
> Hive, but I haven't been able to find any bindings for Ruby or Python, etc.
>
> Any suggestions or advice would be greatly appreciated!
>
> Cheers,
> Mike
>
>   

Mime
View raw message