hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4952) Improved files system interface for the application writer.
Date Thu, 03 Sep 2009 01:59:33 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750779#action_12750779

Sanjay Radia commented on HADOOP-4952:

No one has commented on my proposal on the config issue in this jira.  As a result, over the
last 2 days, I have had a set of discussions with a number of folks at Yahoo, including Doug
and with Dhruba.  Here is roughly the set of opinions:
- Most felt that our config management is a mess and confusing.
- Everyone likes the notion of Server-side defaults esp when you consider federated clusters
and a URI based file namespace as explained in this Jira.
- Some folks were confused about the URI filesystem and how the FileContext lets us deal with
URIs in a first class way. But in the end most felt that it was a good idea. The unix and
scp analogy helped get this across.
- All agreed that most folks will use the SS defaults most of the time. But there are apps
that will specify, for example, the blockSize to override the SS default. They liked that
the create() call had a parameter to do that.
- There were a couple folks who felt strongly that one needs to be able to specify the bytesPerChecksum
on the client side (see the related HDFS-578); strongly enough to -1 a proposal that did not
allow it. Some felt that we should add an additional parameter to the create call while others
felt that we should add an options parameter to the create call.
- There needs to be an undocumented way to override the SS defaults so that one could test
new parameters for SS defaults without reconfiguring the clusters. (Dhruba's suggestion)

Based on the feedback, a proposal is described below. Note for some folks parts of this proposal
represents a compromise, but they could live with it. The 21 deadline is very very close and
we need to get this in or we will miss the deadline. 

FileContext contains the following items derived from the config:
* Default fs - /
* Working dir (derived indirectly via the default file system - details are below)
* Umask.

One creates FileContext as described in the patch (the patch is not uptodate with the proposal
in this comment).
* fc = FileContext.getFC() 
* fc = FileContext.getFC(defaultFsUri), etc. 

*NO other config parameters are read from the config*: The fs client side config contains
only two things: your / and your umask; all defaults will come from SS. However, users will
be able to override these defaults through the options parameter in the create() call when
creating a file. So in this proposal there is not way to set application defaults in the config
(Note We may end up having some undocumented config variables to handle the SS override for
testing purpose (Dhruba's request); exact mechanism to be determined - will file a separate
jira for discussing this one.).

So the basic calls are:
- fc.mkdirs(path, perms)
- fc.create(path, perms, createOpt ...)  // note the use of varArgs
- fc.open(path, bufSize)

Examples of create using varargs
    Fc.create(path, perms) // all SS
    Fc.create(path, perms, CreateOpt.blocksize(4096), CreateOpt.repFac(4)); 

Roughly: CreateOpt is a class with several subclasses, one per option (Blocksize, RepFactor
etc) and a static factory method for each of them such as CreateOpt.blocksize(long).

Here is the list of options that one will be able to set through the createOptions: 
- progressable - default is null => progress not reported
** (ie a spec default, not a SS default.
** Shall we remove progressable?
- iobufferSize    // The rest of the createOptions use SS default if not set
- replicationFactor
- blockSize - must be a multiple of bytesPerChecksum and writePacketsize
- bytesPerChecksum 

The following SS variable is *not* settable via the createOption.
- writePacketSize  - the SS default is always used. 

If the application desires a particular property it will set it in the createOpt paramaters.
There is *no automatic support* to read these app defaults from a config file; *this was deliberate

The actual mechanisms for createOpts is still to be determined but I am strongly leaning towards
varargs rather then a options-Object with setters and getters. 

So please comment on this proposal ASAP. The above proposal was derived after looking at several
alternative and lots of discussions; thanks to all those who participated.

Some details on how wd and home dirs are derived. 
The wd is derived from the default fs; e.g if the defaultFS is localFS the wd of the process
is used to initialize the wd. So HDFS could have SS default for its wd which would be set
to the users home directory in that cluster. Similarly the homedir is derived from the defaultFS
using server side config. (Note we could have the homedir set on the client side by config
vars but I like the way we currently do this for the local filesystem and it would consistent
to derive it from the SS; hence the home dir in a cluster becomes a property of the cluster's
deployment. This also means less client side config variables.)

> Improved files system interface for the application writer.
> -----------------------------------------------------------
>                 Key: HADOOP-4952
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4952
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 0.21.0
>            Reporter: Sanjay Radia
>            Assignee: Sanjay Radia
>         Attachments: FileContext3.patch, FileContext5.patch, FileContext6.patch, FileContext7.patch,
Files.java, Files.java, FilesContext1.patch, FilesContext2.patch
> Currently the FIleSystem interface serves two purposes:
> - an application writer's interface for using the Hadoop file system
> - a file system implementer's interface (e.g. hdfs, local file system, kfs, etc)
> This Jira proposes that we provide a simpler interfaces for the application writer and
leave the FilsSystem  interface for the implementer of a filesystem.
> - Filesystem interface  has a  confusing set of methods for the application writer
> - We could make it easier to take advantage of the URI file naming
> ** Current approach is to get FileSystem instance by supplying the URI and then access
that name space. It is consistent for the FileSystem instance to not accept URIs for other
schemes, but we can do better.
> ** The special copyFromLocalFIle can be generalized as a  copyFile where the src or target
can be generalized to any URI, including the local one.
> ** The proposed scheme (below) simplifies this.
> -	The client side config can be simplified. 
> ** New config() by default uses the default config. Since this is the common usage pattern,
one should not need to always pass the config as a parameter when accessing the file system.
> -	
> ** It does not handle multiple file systems too well. Today a site.xml is derived from
a single Hadoop cluster. This does not make sense for multiple Hadoop clusters which may have
different defaults.
> ** Further one should need very little to configure the client side:
> *** Default files system.
> *** Block size 
> *** Replication factor
> *** Scheme to class mapping
> ** It should be possible to take Blocksize and replication factors defaults from the
target file system, rather then the client size config.  I am not suggesting we don't allow
setting client side defaults, but most clients do not care and would find it simpler to take
the defaults for their systems  from the target file system. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message