hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Bounour <jboun...@ddn.com>
Subject Re: Feature request to provide DFSInputStream subclassing mechanism
Date Wed, 07 Aug 2013 18:11:20 GMT
Hello Jeff

Is it something that could go under HCFS project?
(I might be wrong?)


On 8/7/13 10:59 AM, "Jeff Dost" <jdost@ucsd.edu> wrote:

>We work in a software development team at the UCSD CMS Tier2 Center.  We
>would like to propose a mechanism to allow one to subclass the
>DFSInputStream in a clean way from an external package.  First I'd like
>to give some motivation on why and then will proceed with the details.
>We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment
>at CERN.  There are other T2 centers worldwide that contain mirrors of
>the same data we host.  We are working on an extension to Hadoop that,
>on reading a file, if it is found that there are no available replicas
>of a block, we use an external interface to retrieve this block of the
>file from another data center.  The external interface is necessary
>because not all T2 centers involved in CMS are running a Hadoop cluster
>as their storage backend.
>In order to implement this functionality, we need to subclass the
>DFSInputStream and override the read method, so we can catch
>IOExceptions that occur on client reads at the block level.
>The basic steps required:
>1. Invent a new URI scheme for the customized "FileSystem" in
>   <property>
>     <name>fs.foofs.impl</name>
>     <value>my.package.FooFileSystem</value>
>     <description>My Extended FileSystem for foofs: uris.</description>
>   </property>
>2. Write new classes included in the external package that subclass the
>FooFileSystem subclasses DistributedFileSystem
>FooFSClient subclasses DFSClient
>FooFSInputStream subclasses DFSInputStream
>Now any client commands that explicitly use the foofs:// scheme in paths
>to access the hadoop cluster can open files with a customized
>InputStream that extends functionality of the default hadoop client
>DFSInputStream.  In order to make this happen for our use case, we had
>to change some access modifiers in the DistributedFileSystem, DFSClient,
>and DFSInputStream classes provided by Hadoop.  In addition, we had to
>comment out the check in the namenode code that only allows for URI
>schemes of the form "hdfs://".
>Attached is a patch file we apply to hadoop.  Note that we derived this
>patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be
>found at:
>We would greatly appreciate any advise on whether or not this approach
>sounds reasonable, and if you would consider accepting these
>modifications into the official Hadoop code base.
>Thank you,
>Jeff, Alja & Matevz
>UCSD Physics

View raw message