Return-Path: Delivered-To: apmail-lucene-hadoop-commits-archive@locus.apache.org Received: (qmail 71401 invoked from network); 8 Feb 2007 21:29:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Feb 2007 21:29:48 -0000 Received: (qmail 55619 invoked by uid 500); 8 Feb 2007 21:29:55 -0000 Delivered-To: apmail-lucene-hadoop-commits-archive@lucene.apache.org Received: (qmail 55608 invoked by uid 500); 8 Feb 2007 21:29:55 -0000 Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-commits@lucene.apache.org Received: (qmail 55599 invoked by uid 99); 8 Feb 2007 21:29:55 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Feb 2007 13:29:55 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Feb 2007 13:29:47 -0800 Received: from eos.apache.osuosl.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id AD40D5A1CF for ; Thu, 8 Feb 2007 21:29:27 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: hadoop-commits@lucene.apache.org Date: Thu, 08 Feb 2007 21:29:27 -0000 Message-ID: <20070208212927.14262.65429@eos.apache.osuosl.org> Subject: [Lucene-hadoop Wiki] Update of "AmazonS3" by MichaelStack X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by MichaelStack: http://wiki.apache.org/lucene-hadoop/AmazonS3 The comment on the change is: Added 'running bulk copies in and out of S3' using distcp ------------------------------------------------------------------------------ S3 support was introduced in Hadoop 0.10.0 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]), but this had a few bugs so you should use Hadoop 0.10.1 or later. - The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the - Hadoop CopyFile tool. = Setting up hadoop to use S3 as a replacement for HDFS = @@ -68, +66 @@ bin/start-mapred.sh }}} - = Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce = + = Running bulk copies in and out S3 = - The idea here is to put your input on S3, then transfer it to HDFS using - the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3 - as input to a further job, or retrieved as a final result. - [More instructions will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.] + Support for the S3 filesystem was added to the `${HADOOP_HOME}/bin/hadoop distcp` tool in Hadoop 0.11.0 (See [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862]). The `distcp` tool sets up a MapReduce job to run the copy. Using `distcp`, a cluster of many members can copy lots of data quickly. The number of map tasks is calculated by counting the number of files in the source: i.e. each map task is responsible for the copying one file. Source and target may refer to disparate filesystem types. For example, source might refer to the local filesystem or `hdfs` with `S3` as the target. + The `distcp` tool is useful for quickly prepping S3 for MapReduce jobs that use S3 for input or for backing up the content of `hdfs`. + + Here is an example copying a nutch segment named `0070206153839-1998` at `/user/nutch` in `hdfs` to an S3 bucket named 'nutch' (Let the S3 AWS_ACCESS_KEY_ID be `123` and the S3 AWS_ACCESS_KEY_SECRET be `456`): + + + {{{ + % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456@nutch/ + }}} + + Flip the arguments if you want to run the copy in the opposite direction. + + Other schemes supported by `distcp` are `file` (for local), and `http`. + + You'll likely encounter the following errors if you are running a stock Hadoop 0.11.X. + + {{{ + org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed...We encountered an internal error. Please try again... + + put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072 + }}} + + See [https://issues.apache.org/jira/browse/HADOOP-882 HADOOP-882] for discussion of the above issues and workarounds/fixes. +