hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "HCFS" by SteveLoughran
Date Sun, 28 Apr 2013 04:58:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HCFS" page has been changed by SteveLoughran:

made more on hadoop FS compatibility

- '''Hadoop Compatible File Systems (HCFS)'''
+ '''Hadoop Filesystem Compatibility'''
- Hadoop Core provides a plugin architecture that allows one to configure Hadoop to use a
particular FileSystem via a plugin created specifically for that FileSystem. Hadoop FileSystem
plugin implementations must extend the abstract org.apache.hadoop.fs.FileSystem Class which
provides a set of operations that the FileSystem must implement. This ensures that there are
a base set of Hadoop FileSystem operations that all Hadoop Compatible FileSystems implement
and thus the core underlying FileSystem can be changed without affecting Hadoop Applications
written for Hadoop Clients such as MapReduce and HBase.
+ Apache Hadoop is built on a distributed filesystem, HDFS, capable of storing tens of Petabytes
of data. This filesystem is designed to work with Hadoop from the ground up, with location
aware block placement, integration with the Hadoop tools and both explicit and implicit testing.

- The Hadoop Distributed File System (HDFS) is the most prolifically configured File System
with Hadoop Core and is enabled via the org.apache.hadoop.hdfs.DistributedFileSystem plugin.
HDFS is somewhat different to what you get with most plugins in that it provides not only
the source code and implementation for the plugin, but also for the entire FileSystem itself.
Another example of a popular FileSystem plugin is the S3 Plugin which exposes the Amazon S3
Object Store Service as a Hadoop Compatible File System.
+ Hadoop also works with other filesystems, the platform specific "local" filesystem, [BlobStore|Blobstores]
such as Amazon S3 and Azure storage, as well as  alternative distributed filesystems.
- In some cases, the semantics of the Hadoop FileSystem operations can be ambiguous. The community
is presently attempting to [[https://issues.apache.org/jira/browse/HADOOP-9371| define the
Semantics of the Hadoop FileSystem more rigorously]] as well as adding [[https://issues.apache.org/jira/browse/HADOOP-9258|
better test coverage for Hadoop Compatible File Systems]]
+ All such filesystems (including HDFS) must link up to Hadoop in two ways.
- In addition, work is being done by members of the community to support Hadoop Compatible
FileSystems within the [[http://incubator.apache.org/ambari/| Ambari project to deploy, configure
and manage Hadoop ]]
+  1. The filesystem looks like a "native" filesystem, and is accessed as a local FS, perhaps
with some filesystem-specific means of telling the MapReduce layer which TaskTracker is closest
to the data.
+  1. The filesystem provides an implementation of the `org.apache.hadoop.fs.FileSystem` class
(and in Hadoop v2, in implementation of the `FileContext' class}
- ''What are some examples? ''
+ Implementing the `FileSystem` class ensures that there is an API for applications such as
MapReduce, Apache HBase, Apache Giraph and others can use -including third-party applications
as well as code running in a MapReduce job that wishes to read or write data. 
- LocalFileSystem, S3FileSystem and KosmosFileSystem are all HCFS plugins that ship with Hadoop
and are available under src/core/org/apache/hadoop/fs/
+ The selection of which filesystem to use comes from the URI scheme used to refer to it -the
prefix `hdfs:` on any file path means that it refers to an HDFS filesystem; `file:` to the
local filesystem, `s3:` to Amazon S3, `ftp:` FTP, etc.
- Additionally, the list below includes additional 3rd Party HCFS plugins to enable additional
FileSystems for Hadoop.
+ There are other filesystems that provide explicit integration with Hadoop through the relevant
Java JAR files, native binaries and configuration parameters needed to add a new schema to
Hadoop, such as `fat32:`
- [[http://www.datastax.com/dev/blog/cassandra-file-system-design | CassandraFS]]
+ All providers of filesystem plugins do their utmost to make their filesystems are compatible
with Hadoop. Ambiguities in the Hadoop APIs do not help here -as a lot of the expectations
of Hadoop applications are set not by the FileSystem API, but the behavior of HDFS itself
-which makes it harder to distinguish "bug" from "feature" in the behavior of HDFS. 
- [[http://www.symantec.com/enterprise-solution-for-hadoop | Symtantec Veritas Cluster File
+ We are (as of April 2013), attempting to [[https://issues.apache.org/jira/browse/HADOOP-9371|define
the Semantics of the Hadoop FileSystem more rigorously]] as well as adding [[https://issues.apache.org/jira/browse/HADOOP-9258|better
test coverage for the filesystem APIs]]. This will ensure that we can keep the filesystem
implementations that ship with Hadoop -HDFS itself, and those classes that connect to other
filesystems, currently `s3:`, `s3n:`, `file:`, `ftp:`, `webhdfs`- consistent with each other,
and compatible with existing applications. 
- [[https://github.com/gluster/hadoop-glusterfs | GlusterFS]]
+ This formalisation of the API will also benefit anyone who wishes to to provide a library
that lets Hadoop applications work with their FileSystem -such people have been very constructive
in helping define the FileSystem APIs more rigorously. 
- [[https://issues.apache.org/jira/browse/HADOOP-8545| The Hadoop FileSystem Implementation
for OpenStack Swift]]
- [[ http://answers.mapr.com/questions/116/is-mapr-wire-compatible-or-api-compatible-with-hadoop-0202
| MapR FileSystem]]
+ Here are some 3rd Party plugins to enable additional FileSystems for Hadoop /*Alphabetical
order, no endorsements, please*/.
+  * [[http://www.datastax.com/dev/blog/cassandra-file-system-design | CassandraFS]]
+  * [[https://github.com/gluster/hadoop-glusterfs | GlusterFS]]
+  * [[http://answers.mapr.com/questions/116/is-mapr-wire-compatible-or-api-compatible-with-hadoop-0202
| MapR FileSystem]]
+  * [[http://www.symantec.com/enterprise-solution-for-hadoop | Symtantec Veritas Cluster
File System]]
+ Even if the filesystem is supported by a library for tight integration with Hadoop, it may
behave differently from what Hadoop and applications expect: this is something to explore
with the supplier of the filesystem.
+ What the ASF can do is warn that our own BlobStore filesystems (currently `s3:` and `s3n:`)
are not complete replacements for `hdfs:`, as operations such as `rename()` are only emulated
through copying then deleting all operations, and so a directory rename is not atomic -a requirement
of POSIX filesystems which some applications (MapReduce) currently depend on.
+ Similarly the local `file:` filesystem behaves different only different operating systems,
especially regarding filename case and whether or not you can delete open files. If your intent
is to write code that only ever works with the local filesystem, always test on the target

View raw message