hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "AmazonS3" by MichaelStack
Date Thu, 08 Feb 2007 21:29:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MichaelStack:
http://wiki.apache.org/lucene-hadoop/AmazonS3

The comment on the change is:
Added 'running bulk copies in and out of S3' using distcp

------------------------------------------------------------------------------
  
  S3 support was introduced in Hadoop 0.10.0 ([http://issues.apache.org/jira/browse/HADOOP-574
HADOOP-574]),
  but this had a few bugs so you should use Hadoop 0.10.1 or later.
- The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work
with the
- Hadoop CopyFile tool.
  
  = Setting up hadoop to use S3 as a replacement for HDFS =
  
@@ -68, +66 @@

  bin/start-mapred.sh
  }}}
  
- = Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce
=
  
+ = Running bulk copies in and out S3 =
- The idea here is to put your input on S3, then transfer it to HDFS using 
- the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied
to S3
- as input to a further job, or retrieved as a final result.
  
- [More instructions will be added after [https://issues.apache.org/jira/browse/HADOOP-862
HADOOP-862] is complete.]
+ Support for the S3 filesystem was added to the `${HADOOP_HOME}/bin/hadoop distcp` tool in
Hadoop 0.11.0 (See [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862]).  The `distcp`
tool sets up a MapReduce job to run the copy.  Using `distcp`, a cluster of many members can
copy lots of data quickly.  The number of map tasks is calculated by counting the number of
files in the source: i.e. each map task is responsible for the copying one file.  Source and
target may refer to disparate filesystem types.  For example, source might refer to the local
filesystem or `hdfs` with `S3` as the target.
  
+ The `distcp` tool is useful for quickly prepping S3 for MapReduce jobs that use S3 for input
or for backing up the content of `hdfs`.
+ 
+ Here is an example copying a nutch segment named `0070206153839-1998` at `/user/nutch` in
`hdfs` to an S3 bucket named 'nutch' (Let the S3 AWS_ACCESS_KEY_ID be `123` and the S3 AWS_ACCESS_KEY_SECRET
be `456`):
+ 
+ 
+ {{{
+ % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456@nutch/
+ }}}
+ 
+ Flip the arguments if you want to run the copy in the opposite direction.
+ 
+ Other schemes supported by `distcp` are `file` (for local), and `http`.
+ 
+ You'll likely encounter the following errors if you are running a stock Hadoop 0.11.X.
+ 
+ {{{
+ org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed...We
encountered an internal error. Please try again...
+ 
+ put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available
buffer size of 131072
+ }}}
+ 
+ See [https://issues.apache.org/jira/browse/HADOOP-882 HADOOP-882] for discussion of the
above issues and workarounds/fixes.   
+ 

Mime
View raw message