hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "AmazonS3" by TomWhite
Date Sun, 07 Jan 2007 22:09:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by TomWhite:

New page:
[http://aws.amazon.com/s3 Amazon S3] (Simple Storage Service) is a data storage service. You
are billed
monthly for storage and data transfer. Transfer between S3 and Self:AmazonEC2 is free. This
makes use of
S3 attractive for Hadoop users who run clusters on EC2.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for
(i.e. using it as a reliable distributed filesystem with support for very large files)
or as a convenient repository for data input to and output from Map/Reduce. In the second
HDFS is still used for the Map/Reduce phase.

S3 support was introduced in Hadoop 0.10 ([http://issues.apache.org/jira/browse/HADOOP-574
but it needs the patch in [http://issues.apache.org/jira/browse/HADOOP-857 HADOOP-857] to
work properly.
The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with
Hadoop CopyFile tool.
(Hopefully these patches will be integrated in the next release.)

= Setting up hadoop to use S3 as a replacement for HDFS =

Put the following in ''conf/hadoop-site.xml'' to set the default filesystem to be S3:




Alternatively, you can put the access key ID and the secret access key into the S3 URI as
the user info:


Note that since the secret
access key can contain slashes, you must remember to escape them by replacing each slash `/`
with the string `%2F`.
Keys specified in the URI take precedence over any specified using the properties `fs.s3.awsAccessKeyId`

Running the Map/Reduce demo in the [http://lucene.apache.org/hadoop/api/index.html Hadoop
API Documentation] using
S3 is now a matter of running:

mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*

To run in distributed mode you only need to run a JobTracker - the HDFS NameNode is unnecessary.

= Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce

The idea here is to put your input on S3, then transfer it to HDFS using 
the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied
to S3
as input to a further job, or retrieved as a final result.

[More instruction will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862]
is complete.]

View raw message