From mapreduce-commits-return-443-apmail-hadoop-mapreduce-commits-archive=hadoop.apache.org@hadoop.apache.org Tue Nov 03 08:02:36 2009 Return-Path: Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: (qmail 25669 invoked from network); 3 Nov 2009 08:02:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Nov 2009 08:02:35 -0000 Received: (qmail 71748 invoked by uid 500); 3 Nov 2009 08:02:35 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 71695 invoked by uid 500); 3 Nov 2009 08:02:35 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 71684 invoked by uid 99); 3 Nov 2009 08:02:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 08:02:35 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2009 08:02:33 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 4E43723888BB; Tue, 3 Nov 2009 08:02:12 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r832329 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/distcp.xml Date: Tue, 03 Nov 2009 08:02:12 -0000 To: mapreduce-commits@hadoop.apache.org From: cdouglas@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20091103080212.4E43723888BB@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: cdouglas Date: Tue Nov 3 08:02:11 2009 New Revision: 832329 URL: http://svn.apache.org/viewvc?rev=832329&view=rev Log: MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts in particular. Contributed by Aaron Kimball Modified: hadoop/mapreduce/trunk/CHANGES.txt hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml Modified: hadoop/mapreduce/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=832329&r1=832328&r2=832329&view=diff ============================================================================== --- hadoop/mapreduce/trunk/CHANGES.txt (original) +++ hadoop/mapreduce/trunk/CHANGES.txt Tue Nov 3 08:02:11 2009 @@ -486,6 +486,9 @@ MAPREDUCE-1012. Mark Context interfaces as public evolving. (Tom White via cdouglas) + MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts + in particular. (Aaron Kimball via cdouglas) + BUG FIXES MAPREDUCE-878. Rename fair scheduler design doc to Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml?rev=832329&r1=832328&r2=832329&view=diff ============================================================================== --- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml (original) +++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml Tue Nov 3 08:02:11 2009 @@ -317,6 +317,36 @@
+ Copying to S3 + +

DistCp can be used to copy data between HDFS and other filesystems, + including those backed by S3. The s3n FileSystem + implementation allows DistCp (and Hadoop in general) to use an S3 + bucket as a source or target for transfers. To transfer data from + HDFS to an S3 bucket, invoke DistCp using arguments like the following: +

+ +bash$ hadoop distcp hdfs://nn:8020/foo/bar \ + s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@<bucket>/foo/bar + + +

$AWS_ACCESS_KEY_ID and + $AWS_SECRET_ACCESS_KEY are environment variables holding + S3 access credentials.

+ +

Some FileSystem operations take longer on S3 than on HDFS. If you + are transferring large files to S3 (e.g., 1 GB and up), you may + experience timeouts during your job. To prevent this, you should set + the task timeout to a larger interval than is typically used: +

+ +bash$ hadoop distcp -D mapred.task.timeout=1800000 \ + hdfs://nn:8020/foo/bar \ + s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@<bucket>/foo/bar + +
+ +
MapReduce and Other Side-effects

As has been mentioned in the preceding, should a map fail to copy