hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gibbon, Robert, VF-Group" <Robert.Gib...@vodafone.com>
Subject RE: web-based file transfer
Date Fri, 05 Nov 2010 23:19:09 GMT


> What are the performance characteristics like for the webdav solution?

The HDFS over WebDAV setup is horizontally scalable, just keep adding Jettys and put a round-robin
VIP on the front. It is stateless so there's no need for sticky session.

It is not especially chatty unless doing complex directory traversals - HTTP/Put & HTTP/Get
- much the same as most REST implementations in fact.

For us, it's more than good enough.



-----Original Message-----
From: Mark Laffoon [mailto:mlaffoon@semanticresearch.com]
Sent: Fri 11/5/2010 10:34 PM
To: general@hadoop.apache.org
Subject: RE: web-based file transfer
 
Robert,

What are the performance characteristics like for the webdav solution? I
ask for two reasons: since it is implemented over tcp it probably isn't
much faster than http fileupload; I've had previous experience with webdav
(on top of object stores) and we found the protocol to be very "chatty".

Since our use-case is fairly simple (just need to transfer lots of files
from lots of clients; navigating the results isn't necessary), will the
webdav solution be too much?

Comments?

Thanks!

-----Original Message-----
From: Gibbon, Robert, VF-Group [mailto:Robert.Gibbon@vodafone.com] 
Sent: Wednesday, November 03, 2010 4:20 PM
To: general@hadoop.apache.org; general@hadoop.apache.org
Subject: RE: web-based file transfer

Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is
pretty industry standard and is built on Apache JackRabbit which is pretty
production stable too. I lashed together a custom JAAS authentication
module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with
cadaver on Solaris/Unix without mounting WebDav. It works pretty sweet on
Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features,
although you might get more mileage from HAProxy or a hardware
loadbalancer.

It works pretty sweet as it enforces HDFS permissions (if you have them
enabled). To get Hadoop permission integrity enforced on MapReduce jobs
check out Oozie - it's a job submission proxy which runs under Tomcat
(might work with Jetty too - haven't tried) and can use a custom
ServletFilter for authentication which you can also patch onto your own
user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall
rules and you're good to go

No more Kerberos!
R

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer
 
Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <hadoop@holsman.net> wrote:
> Doesn't chukwa do something like this?
>
> ---
> Ian Holsman - 703 879-3128
>
> I saw the angel in the marble and carved until I set him free --
Michelangelo
>
> On 03/11/2010, at 5:44 AM, Eric Sammer <esammer@cloudera.com> wrote:
>
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>>
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>>
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>>
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>>
>> Hope that helps.
>>
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <mlaffoon@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java
applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to
hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet
where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>>
>>>
>>>
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get
ideas).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com



Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message