hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gibbon, Robert, VF-Group" <Robert.Gib...@vodafone.com>
Subject RE: web-based file transfer
Date Wed, 03 Nov 2010 23:19:32 GMT
Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is pretty industry standard
and is built on Apache JackRabbit which is pretty production stable too. I lashed together
a custom JAAS authentication module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with cadaver on Solaris/Unix
without mounting WebDav. It works pretty sweet on Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features, although you might
get more mileage from HAProxy or a hardware loadbalancer.

It works pretty sweet as it enforces HDFS permissions (if you have them enabled). To get Hadoop
permission integrity enforced on MapReduce jobs check out Oozie - it's a job submission proxy
which runs under Tomcat (might work with Jetty too - haven't tried) and can use a custom ServletFilter
for authentication which you can also patch onto your own user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall rules and you're good
to go

No more Kerberos!

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer
Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <hadoop@holsman.net> wrote:
> Doesn't chukwa do something like this?
> ---
> Ian Holsman - 703 879-3128
> I saw the angel in the marble and carved until I set him free -- Michelangelo
> On 03/11/2010, at 5:44 AM, Eric Sammer <esammer@cloudera.com> wrote:
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>> Hope that helps.
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <mlaffoon@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get ideas).
>>> Thanks,
>>> Mark
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com

Eric Sammer
twitter: esammer
data: www.cloudera.com

View raw message