Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Apache Wiki <wikidiffs@apache.org>
To: hadoop-commits@lucene.apache.org
Date: Thu, 08 Feb 2007 21:29:27 -0000
Message-ID: <20070208212927.14262.65429@eos.apache.osuosl.org>
Subject: [Lucene-hadoop Wiki] Update of "AmazonS3" by MichaelStack

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MichaelStack:
http://wiki.apache.org/lucene-hadoop/AmazonS3

The comment on the change is:
Added 'running bulk copies in and out of S3' using distcp

------------------------------------------------------------------------------
  
  S3 support was introduced in Hadoop 0.10.0 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]),
  but this had a few bugs so you should use Hadoop 0.10.1 or later.
- The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the
- Hadoop CopyFile tool.
  
  = Setting up hadoop to use S3 as a replacement for HDFS =
  
@@ -68, +66 @@

  bin/start-mapred.sh
  }}}
  
- = Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce =
  
+ = Running bulk copies in and out S3 =
- The idea here is to put your input on S3, then transfer it to HDFS using 
- the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3
- as input to a further job, or retrieved as a final result.
  
- [More instructions will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.]
+ Support for the S3 filesystem was added to the `${HADOOP_HOME}/bin/hadoop distcp` tool in Hadoop 0.11.0 (See [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862]).  The `distcp` tool sets up a MapReduce job to run the copy.  Using `distcp`, a cluster of many members can copy lots of data quickly.  The number of map tasks is calculated by counting the number of files in the source: i.e. each map task is responsible for the copying one file.  Source and target may refer to disparate filesystem types.  For example, source might refer to the local filesystem or `hdfs` with `S3` as the target.
  
+ The `distcp` tool is useful for quickly prepping S3 for MapReduce jobs that use S3 for input or for backing up the content of `hdfs`.
+ 
+ Here is an example copying a nutch segment named `0070206153839-1998` at `/user/nutch` in `hdfs` to an S3 bucket named 'nutch' (Let the S3 AWS_ACCESS_KEY_ID be `123` and the S3 AWS_ACCESS_KEY_SECRET be `456`):
+ 
+ 
+ {{{
+ % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456@nutch/
+ }}}
+ 
+ Flip the arguments if you want to run the copy in the opposite direction.
+ 
+ Other schemes supported by `distcp` are `file` (for local), and `http`.
+ 
+ You'll likely encounter the following errors if you are running a stock Hadoop 0.11.X.
+ 
+ {{{
+ org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed...We encountered an internal error. Please try again...
+ 
+ put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072
+ }}}
+ 
+ See [https://issues.apache.org/jira/browse/HADOOP-882 HADOOP-882] for discussion of the above issues and workarounds/fixes.   
+