Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by MichaelStack: http://wiki.apache.org/lucene-hadoop/AmazonS3 The comment on the change is: Added 'running bulk copies in and out of S3' using distcp ------------------------------------------------------------------------------ S3 support was introduced in Hadoop 0.10.0 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]), but this had a few bugs so you should use Hadoop 0.10.1 or later. - The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the - Hadoop CopyFile tool. = Setting up hadoop to use S3 as a replacement for HDFS = @@ -68, +66 @@ bin/start-mapred.sh }}} - = Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce = + = Running bulk copies in and out S3 = - The idea here is to put your input on S3, then transfer it to HDFS using - the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3 - as input to a further job, or retrieved as a final result. - [More instructions will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.] + Support for the S3 filesystem was added to the `${HADOOP_HOME}/bin/hadoop distcp` tool in Hadoop 0.11.0 (See [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862]). The `distcp` tool sets up a MapReduce job to run the copy. Using `distcp`, a cluster of many members can copy lots of data quickly. The number of map tasks is calculated by counting the number of files in the source: i.e. each map task is responsible for the copying one file. Source and target may refer to disparate filesystem types. For example, source might refer to the local filesystem or `hdfs` with `S3` as the target. + The `distcp` tool is useful for quickly prepping S3 for MapReduce jobs that use S3 for input or for backing up the content of `hdfs`. + + Here is an example copying a nutch segment named `0070206153839-1998` at `/user/nutch` in `hdfs` to an S3 bucket named 'nutch' (Let the S3 AWS_ACCESS_KEY_ID be `123` and the S3 AWS_ACCESS_KEY_SECRET be `456`): + + + {{{ + % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456@nutch/ + }}} + + Flip the arguments if you want to run the copy in the opposite direction. + + Other schemes supported by `distcp` are `file` (for local), and `http`. + + You'll likely encounter the following errors if you are running a stock Hadoop 0.11.X. + + {{{ + org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed...We encountered an internal error. Please try again... + + put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072 + }}} + + See [https://issues.apache.org/jira/browse/HADOOP-882 HADOOP-882] for discussion of the above issues and workarounds/fixes. +