hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: web-based file transfer
Date Tue, 02 Nov 2010 18:44:01 GMT
I would recommend against clients pushing data directly to hdfs like
this for a few reasons.

1. The HDFS cluster would need to be directly exposed to a public
network; you don't want to do this.
2. You'd be applying (presumably) a high concurrent load to HDFS which
isn't its strong point.

>From an architecture point of view, it's much nicer to have a queuing
system between the upload and ingestion into HDFS that you can
throttle and control, if necessary. This also allows you to isolate
the cluster from the outside world. As to not bottleneck on a single
writer, you can have uploaded files land in a queue and have multiple
competing consumers popping files (or file names upon which to
operate) out of the queue and handling the writing in parallel while
being able to control the number of workers. If the initial upload is
to a shared device like NFS, you can have writers live on multiple
boxes and distribute the work.

Another option is to consider Flume, but only if you can deal with the
fact that it effectively throws away the notion of files and treats
their contents as individual events, etc.

Hope that helps.

On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
<mlaffoon@semanticresearch.com> wrote:
> We want to enable our web-based client (i.e. browser client, java applet,
> whatever?) to transfer files into a system backed by hdfs. The obvious
> simple solution is to do http file uploads, then copy the file to hdfs. I
> was wondering if there is a way to do it with an hdfs-enabled applet where
> the server gives the client the necessary hadoop configuration
> information, and the client applet pushes the data directly into hdfs.
> Has anybody done this or something similar? Can you give me a starting
> point (I'm about to go wander through the hadoop CLI code to get ideas).
> Thanks,
> Mark

Eric Sammer
twitter: esammer
data: www.cloudera.com

View raw message