Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 98452 invoked from network); 3 Nov 2010 11:35:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 11:35:47 -0000 Received: (qmail 55792 invoked by uid 500); 3 Nov 2010 11:36:18 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 55418 invoked by uid 500); 3 Nov 2010 11:36:12 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 55410 invoked by uid 99); 3 Nov 2010 11:36:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 11:36:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL,URI_OBFU_WWW X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [192.6.10.2] (HELO colossus.hpl.hp.com) (192.6.10.2) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 11:36:05 +0000 Received: from localhost (localhost [127.0.0.1]) by colossus.hpl.hp.com (Postfix) with ESMTP id EB29C1BA5FA for ; Wed, 3 Nov 2010 11:35:43 +0000 (GMT) X-Virus-Scanned: Debian amavisd-new at hpl.hp.com Received: from colossus.hpl.hp.com ([127.0.0.1]) by localhost (colossus.hpl.hp.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id SQAnj0m2Cp1V for ; Wed, 3 Nov 2010 11:35:43 +0000 (GMT) Received: from 0-imap-br1.hpl.hp.com (0-imap-br1.hpl.hp.com [16.25.144.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by colossus.hpl.hp.com (Postfix) with ESMTPS id 1641D1BA59D for ; Wed, 3 Nov 2010 11:35:42 +0000 (GMT) MailScanner-NULL-Check: 1289388922.05139@hixvR3Ol+0/CudDeblngsA Received: from [16.25.175.158] (morzine.hpl.hp.com [16.25.175.158]) by 0-imap-br1.hpl.hp.com (8.14.1/8.13.4) with ESMTP id oA3BZLw9010972 for ; Wed, 3 Nov 2010 11:35:21 GMT Message-ID: <4CD148F9.8080600@apache.org> Date: Wed, 03 Nov 2010 11:35:21 +0000 From: Steve Loughran User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.14) Gecko/20101006 Thunderbird/3.0.9 MIME-Version: 1.0 To: general@hadoop.apache.org Subject: Re: web-based file transfer References: <033e01cb7abb$623b7d50$26b277f0$@com> In-Reply-To: <033e01cb7abb$623b7d50$26b277f0$@com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-HPL-MailScanner-Information: Please contact the ISP for more information X-MailScanner-ID: oA3BZLw9010972 X-HPL-MailScanner: Found to be clean X-HPL-MailScanner-From: stevel@apache.org On 02/11/10 18:25, Mark Laffoon wrote: > We want to enable our web-based client (i.e. browser client, java applet, > whatever?) to transfer files into a system backed by hdfs. The obvious > simple solution is to do http file uploads, then copy the file to hdfs. I > was wondering if there is a way to do it with an hdfs-enabled applet where > the server gives the client the necessary hadoop configuration > information, and the client applet pushes the data directly into hdfs. I recall some work done with webdav https://issues.apache.org/jira/browse/HDFS-225 -but I don't think it's progressed I've done things like this in the past with servlets and forms; the webapp you deploy has the hadoop configuration (and the network rights to talk to HDFS in the datacentre), the clients PUT/POST up content http://www.slideshare.net/steve_l/long-haul-hadoop However, you are limited to 2GB worth of upload/download in most web clients, some (chrome) go up to 4GB but you are pushing the limit there. Even all the Java servlet APIs assume that the content-length header fits into a signed 32 bit integer and gets unhappy once you go over 2GB (something I worry about in http://jira.smartfrog.org/jira/browse/SFOS-1476 ) Because Hadoop really likes large files -tens to hundreds of GB in a big cluster- I don't think the current web infrastructure is up to working with it. that said, looking at hudson for the nightly runs of my bulk IO tests , jetty will serve up 4GB in 5 minutes (loopback if), and I can POST or PUT up 4GB, but I have to get/set content length headers myself rather than rely on the java.net client and servlet implementations to handle it: http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/www/src/org/smartfrog/services/www/bulkio/client/SunJavaBulkIOClient.java?revision=8430&view=markup If you can control the client, then maybe you would be able to do >4GB uploads, but otherwise you are stuck with data <2GB in size, which is, -what- 4-8 blocks in a production cluster? -steve