hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "AmazonS3" by SteveLoughran
Date Wed, 04 May 2016 12:57:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "AmazonS3" page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/AmazonS3?action=diff&rev1=19&rev2=20

Comment:
update s3a docs, callout AWS, change heading levels

+ = S3 Support in Apache Hadoop =
+ 
  [[http://aws.amazon.com/s3|Amazon S3]] (Simple Storage Service) is a data storage service.
You are billed
- monthly for storage and data transfer. Transfer between S3 and [[AmazonEC2]] is free. This
makes use of
+ monthly for storage and data transfer. Transfer between S3 and [[AmazonEC2]] instances in
the same geographical location are free. This makes use of
  S3 attractive for Hadoop users who run clusters on EC2.
  
  Hadoop provides multiple filesystem clients for reading and writing to and from Amazon S3
or compatible service.
  
-  S3 Native FileSystem (URI scheme: s3n)::
+ === S3 Native FileSystem (URI scheme: s3n) ===
-  A native filesystem for reading and writing regular files on S3. The advantage of this
filesystem is that you can access files on S3 that were written with other tools. Conversely,
other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file
size imposed by S3.
  
+ A native filesystem for reading and writing regular files on S3. The advantage of this filesystem
is that you can access files on S3 that were written with other tools. Conversely, other tools
can access files written using Hadoop. The S3N code is stable and widely used, but is not
adding any new features (which is why it remains stable). S3N requires a suitable version
of the jets3t JAR on the classpath.
-  S3A (URI scheme: s3a)::
-  A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact
with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations
and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects
accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL
schema.
  
+ === S3A (URI scheme: s3a) ===
+ 
+ A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries
to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance
operations and more. The filesystem is intended to be a replacement for/successor to S3 Native:
all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing
the URL schema.
+ 
+ S3A has been considered usable in production since Hadoop 2.7, and is undergoing active
maintenance for enhanced security, scalability and performance.
+ 
+ '''important:''' S3A requires the exact version of the amazon-aws-sdk against which Hadoop
was built (and is bundled with).
+ 
-  S3 Block FileSystem (URI scheme: s3)::
+ === S3 Block FileSystem (URI scheme: s3) ===
+ 
+ '''important:''' this section covers the s3:// filesystem support inside Apache Hadoop.
The one in Amazon EMR is different —see the details at the bottom of this page.
+ 
-  A block-based filesystem backed by S3. Files are stored as blocks, just like they are in
HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate
a bucket for the filesystem - you should not use an existing bucket containing files, or write
other files to the same bucket. The files stored by this filesystem can be larger than 5GB,
but they are not interoperable with other S3 tools.
+ A block-based filesystem backed by S3. Files are stored as blocks, just like they are in
HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate
a bucket for the filesystem - you should not use an existing bucket containing files, or write
other files to the same bucket. The files stored by this filesystem can be larger than 5GB,
but they are not interoperable with other S3 tools. Nobody is/should be uploading data to
S3 via this scheme any more; it will eventually be removed from Hadoop entirely. Consider
it (as of May 2016), deprecated.
+ 
  
  S3 can be used as a convenient repository for data input to and output for analytics applications
using either S3 filesystem.
  Data in S3 outlasts Hadoop clusters on EC2, so they should be where persistent data must
be kept.
  
  Note that by using S3 as an input you lose the data locality optimization, which may be
significant. The general best practise is to copy in data using `distcp` at the start of a
workflow, then copy it out at the end, using the transient HDFS in between.
  
- = History =
+ == History ==
   * The S3 block filesystem was introduced in Hadoop 0.10.0 ([[http://issues.apache.org/jira/browse/HADOOP-574|HADOOP-574]]).
   * The S3 native filesystem was introduced in Hadoop 0.18.0 ([[http://issues.apache.org/jira/browse/HADOOP-930|HADOOP-930]])
and rename support was added in Hadoop 0.19.0 ([[https://issues.apache.org/jira/browse/HADOOP-3361|HADOOP-3361]]).
   * The S3A filesystem was introduced in Hadoop 2.6.0. Some issues were found and fixed for
later Hadoop versions [[https://issues.apache.org/jira/browse/HADOOP-11571|HADOOP-11571]].
  
  
- = Configuring and using the S3 filesystem support =
+ == Working with S3 from Apache Hadoop ==
  
- Consult the [[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md|Latest
Hadoop documentation]] for the specifics on using any of the S3 clients.
+ Consult the [[http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html|Latest
Hadoop documentation]] for the specifics on using any of the S3 clients.
  
  
- = Important: you cannot use S3 as a replacement for HDFS =
+ === Important: you cannot use S3 as a replacement for HDFS ===
  
  You cannot use any of the S3 filesystem clients as a drop-in replacement for HDFS. Amazon
S3 is an "object store" with
   * eventual consistency: changes made by one application (creation, updates and deletions)
will not be visible until some undefined time.
@@ -40, +52 @@

  S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to be a filesystem,
but it is not. It can
  act as a source of data, and as a destination -though in the latter case, you must remember
that the output may not be immediately visible.
  
- = Security =
+ === Security ===
  
  Your Amazon Secret Access Key is that: secret. If it gets known you have to go to the [[https://portal.aws.amazon.com/gp/aws/securityCredentials|Security
Credentials]] page and revoke it. Try and avoid printing it in logs, or checking the XML configuration
files into revision control. Do not ever check it in to revision control systems.
  
- = Running bulk copies in and out of S3 =
+ === Running bulk copies in and out of S3 ===
  
  Support for the S3 block filesystem was added to the `${HADOOP_HOME}/bin/hadoop distcp`
tool in Hadoop 0.11.0 (See [[https://issues.apache.org/jira/browse/HADOOP-862|HADOOP-862]]).
 The `distcp` tool sets up a MapReduce job to run the copy.  Using `distcp`, a cluster of
many members can copy lots of data quickly.  The number of map tasks is calculated by counting
the number of files in the source: i.e. each map task is responsible for the copying one file.
 Source and target may refer to disparate filesystem types.  For example, source might refer
to the local filesystem or `hdfs` with `S3` as the target.
  
@@ -54, +66 @@

  
  
  {{{
- % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456@nutch/
+ % ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3a://123:456@nutch/
  }}}
  
  Flip the arguments if you want to run the copy in the opposite direction.
  
  Other schemes supported by `distcp` include `file:` (for local), and `http:`.
  
+ == S3 Support in Amazon EMR ==
+ 
+ Amazon's EMR Service is based upon Apache Hadoop, but contains modifications and their own,
proprietary, S3 client. Consult [[http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html|Amazon's
documentation on this]]. Due to the fact that their code is proprietary, only Amazon can provide
support and/or field bug reports related to their S3 support.
+ 

---------------------------------------------------------------------
To unsubscribe, e-mail: common-commits-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-commits-help@hadoop.apache.org


Mime
View raw message