hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Dost <jd...@ucsd.edu>
Subject Feature request to provide DFSInputStream subclassing mechanism
Date Wed, 07 Aug 2013 17:59:48 GMT
Hello,

We work in a software development team at the UCSD CMS Tier2 Center.  We 
would like to propose a mechanism to allow one to subclass the 
DFSInputStream in a clean way from an external package.  First I'd like 
to give some motivation on why and then will proceed with the details.

We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment 
at CERN.  There are other T2 centers worldwide that contain mirrors of 
the same data we host.  We are working on an extension to Hadoop that, 
on reading a file, if it is found that there are no available replicas 
of a block, we use an external interface to retrieve this block of the 
file from another data center.  The external interface is necessary 
because not all T2 centers involved in CMS are running a Hadoop cluster 
as their storage backend.

In order to implement this functionality, we need to subclass the 
DFSInputStream and override the read method, so we can catch 
IOExceptions that occur on client reads at the block level.

The basic steps required:
1. Invent a new URI scheme for the customized "FileSystem" in core-site.xml:
   <property>
     <name>fs.foofs.impl</name>
     <value>my.package.FooFileSystem</value>
     <description>My Extended FileSystem for foofs: uris.</description>
   </property>

2. Write new classes included in the external package that subclass the 
following:
FooFileSystem subclasses DistributedFileSystem
FooFSClient subclasses DFSClient
FooFSInputStream subclasses DFSInputStream

Now any client commands that explicitly use the foofs:// scheme in paths 
to access the hadoop cluster can open files with a customized 
InputStream that extends functionality of the default hadoop client 
DFSInputStream.  In order to make this happen for our use case, we had 
to change some access modifiers in the DistributedFileSystem, DFSClient, 
and DFSInputStream classes provided by Hadoop.  In addition, we had to 
comment out the check in the namenode code that only allows for URI 
schemes of the form "hdfs://".

Attached is a patch file we apply to hadoop.  Note that we derived this 
patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be 
found at:
http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz

We would greatly appreciate any advise on whether or not this approach 
sounds reasonable, and if you would consider accepting these 
modifications into the official Hadoop code base.

Thank you,
Jeff, Alja & Matevz
UCSD Physics

Mime
View raw message