hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Laffoon" <mlaff...@semanticresearch.com>
Subject RE: web-based file transfer
Date Fri, 05 Nov 2010 21:34:30 GMT

What are the performance characteristics like for the webdav solution? I
ask for two reasons: since it is implemented over tcp it probably isn't
much faster than http fileupload; I've had previous experience with webdav
(on top of object stores) and we found the protocol to be very "chatty".

Since our use-case is fairly simple (just need to transfer lots of files
from lots of clients; navigating the results isn't necessary), will the
webdav solution be too much?



-----Original Message-----
From: Gibbon, Robert, VF-Group [mailto:Robert.Gibbon@vodafone.com] 
Sent: Wednesday, November 03, 2010 4:20 PM
To: general@hadoop.apache.org; general@hadoop.apache.org
Subject: RE: web-based file transfer

Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is
pretty industry standard and is built on Apache JackRabbit which is pretty
production stable too. I lashed together a custom JAAS authentication
module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with
cadaver on Solaris/Unix without mounting WebDav. It works pretty sweet on
Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features,
although you might get more mileage from HAProxy or a hardware

It works pretty sweet as it enforces HDFS permissions (if you have them
enabled). To get Hadoop permission integrity enforced on MapReduce jobs
check out Oozie - it's a job submission proxy which runs under Tomcat
(might work with Jetty too - haven't tried) and can use a custom
ServletFilter for authentication which you can also patch onto your own
user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall
rules and you're good to go

No more Kerberos!

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer
Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <hadoop@holsman.net> wrote:
> Doesn't chukwa do something like this?
> ---
> Ian Holsman - 703 879-3128
> I saw the angel in the marble and carved until I set him free --
> On 03/11/2010, at 5:44 AM, Eric Sammer <esammer@cloudera.com> wrote:
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>> Hope that helps.
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <mlaffoon@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to
hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get
>>> Thanks,
>>> Mark
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com

Eric Sammer
twitter: esammer
data: www.cloudera.com

View raw message