hadoop-mapreduce-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cdoug...@apache.org
Subject svn commit: r832329 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/distcp.xml
Date Tue, 03 Nov 2009 08:02:12 GMT
Author: cdouglas
Date: Tue Nov  3 08:02:11 2009
New Revision: 832329

URL: http://svn.apache.org/viewvc?rev=832329&view=rev
Log:
MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts
in particular. Contributed by Aaron Kimball

Modified:
    hadoop/mapreduce/trunk/CHANGES.txt
    hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml

Modified: hadoop/mapreduce/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=832329&r1=832328&r2=832329&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/CHANGES.txt (original)
+++ hadoop/mapreduce/trunk/CHANGES.txt Tue Nov  3 08:02:11 2009
@@ -486,6 +486,9 @@
     MAPREDUCE-1012. Mark Context interfaces as public evolving. (Tom White via
     cdouglas)
 
+    MAPREDUCE-971. Document use of distcp when copying to s3, managing timeouts
+    in particular. (Aaron Kimball via cdouglas)
+
   BUG FIXES
 
     MAPREDUCE-878. Rename fair scheduler design doc to 

Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml?rev=832329&r1=832328&r2=832329&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml (original)
+++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/distcp.xml Tue Nov  3
08:02:11 2009
@@ -317,6 +317,36 @@
       </section>
 
       <section>
+        <title>Copying to S3</title>
+
+        <p>DistCp can be used to copy data between HDFS and other filesystems,
+        including those backed by S3. The <code>s3n</code> FileSystem
+        implementation allows DistCp (and Hadoop in general) to use an S3
+        bucket as a source or target for transfers. To transfer data from
+        HDFS to an S3 bucket, invoke DistCp using arguments like the following:
+        </p>
+<source>
+bash$ hadoop distcp hdfs://nn:8020/foo/bar \
+    s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@&lt;bucket&gt;/foo/bar
+</source>
+
+        <p><code>$AWS_ACCESS_KEY_ID</code> and
+        <code>$AWS_SECRET_ACCESS_KEY</code> are environment variables holding
+        S3 access credentials.</p>
+
+        <p>Some FileSystem operations take longer on S3 than on HDFS. If you
+        are transferring large files to S3 (e.g., 1 GB and up), you may
+        experience timeouts during your job. To prevent this, you should set
+        the task timeout to a larger interval than is typically used:
+        </p>
+<source>
+bash$ hadoop distcp -D mapred.task.timeout=1800000 \
+    hdfs://nn:8020/foo/bar \
+    s3n://$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY@&lt;bucket&gt;/foo/bar
+</source>
+      </section>
+
+      <section>
         <title>MapReduce and Other Side-effects</title>
 
         <p>As has been mentioned in the preceding, should a map fail to copy



Mime
View raw message