From general-return-2318-apmail-hadoop-general-archive=hadoop.apache.org@hadoop.apache.org Wed Nov 03 16:05:20 2010 Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 26258 invoked from network); 3 Nov 2010 16:05:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 16:05:20 -0000 Received: (qmail 31704 invoked by uid 500); 3 Nov 2010 16:05:50 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 31420 invoked by uid 500); 3 Nov 2010 16:05:48 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 31412 invoked by uid 99); 3 Nov 2010 16:05:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 16:05:47 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.215.176] (HELO mail-ey0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 16:05:43 +0000 Received: by eyz10 with SMTP id 10so354406eyz.35 for ; Wed, 03 Nov 2010 09:05:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.71.206 with SMTP id r56mr793182wed.29.1288800320135; Wed, 03 Nov 2010 09:05:20 -0700 (PDT) Received: by 10.216.161.201 with HTTP; Wed, 3 Nov 2010 09:05:19 -0700 (PDT) In-Reply-To: References: <033e01cb7abb$623b7d50$26b277f0$@com> Date: Wed, 3 Nov 2010 12:05:19 -0400 Message-ID: Subject: Re: web-based file transfer From: Eric Sammer To: general@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Something like it, but Chukwa is more similar to Flume. For *files* one may want something slightly different. For a stream of (data) events, Chukwa, Flume, or Scribe are appropriate. On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman wrote: > Doesn't chukwa do something like this? > > --- > Ian Holsman - 703 879-3128 > > I saw the angel in the marble and carved until I set him free -- Michelangelo > > On 03/11/2010, at 5:44 AM, Eric Sammer wrote: > >> I would recommend against clients pushing data directly to hdfs like >> this for a few reasons. >> >> 1. The HDFS cluster would need to be directly exposed to a public >> network; you don't want to do this. >> 2. You'd be applying (presumably) a high concurrent load to HDFS which >> isn't its strong point. >> >> From an architecture point of view, it's much nicer to have a queuing >> system between the upload and ingestion into HDFS that you can >> throttle and control, if necessary. This also allows you to isolate >> the cluster from the outside world. As to not bottleneck on a single >> writer, you can have uploaded files land in a queue and have multiple >> competing consumers popping files (or file names upon which to >> operate) out of the queue and handling the writing in parallel while >> being able to control the number of workers. If the initial upload is >> to a shared device like NFS, you can have writers live on multiple >> boxes and distribute the work. >> >> Another option is to consider Flume, but only if you can deal with the >> fact that it effectively throws away the notion of files and treats >> their contents as individual events, etc. >> http://github.com/cloudera/flume. >> >> Hope that helps. >> >> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon >> wrote: >>> We want to enable our web-based client (i.e. browser client, java applet, >>> whatever?) to transfer files into a system backed by hdfs. The obvious >>> simple solution is to do http file uploads, then copy the file to hdfs. I >>> was wondering if there is a way to do it with an hdfs-enabled applet where >>> the server gives the client the necessary hadoop configuration >>> information, and the client applet pushes the data directly into hdfs. >>> >>> >>> >>> Has anybody done this or something similar? Can you give me a starting >>> point (I'm about to go wander through the hadoop CLI code to get ideas). >>> >>> >>> >>> Thanks, >>> >>> Mark >>> >>> >> >> >> >> -- >> Eric Sammer >> twitter: esammer >> data: www.cloudera.com > -- Eric Sammer twitter: esammer data: www.cloudera.com