hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12202) Provide new set of FileSystem API to bypass external attribute provider
Date Wed, 09 Aug 2017 17:30:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120313#comment-16120313
] 

Yongjun Zhang commented on HDFS-12202:
--------------------------------------

Thanks for the feedback [~asuresh] [~manojg] and [~chris.douglas]. Sorry for my delayed reply
since I was out for some time.

[~asuresh]:
{quote}
but I feel we should maybe explore ways around having to modify the HDFS API by configuring
the External provider to return the underlying Attributes (and possibly bypass permission
checks) for just a white-listed set of users (and/or a configured set of name-spaces) - this
implies that performing distcp (without copying over the externally over-laid attributes)
might be restricted to only a few users of the cluster - but from a practical standpoint,
I think it should be reasonable, since I believe that for most clusters, this cluster-to-cluster
copying does not happen very often and I usually performed by an cluster admin / manager.

{quote}
This is an interesting idea! I wish there is a solution that we can avoid modifying HDFS API.
For a white-listed set of users, however, I would assume that some of the calls issued by
these users need to access the external attributes, and some don't, we don't have a way to
distinguish. For example, the same user may run different applications too than distcp too.
How we are going to solve this with the white-list solution?

In addition, ANY user can run distcp, and distcp can happen within a same cluster too. If
we want to  these, would it be too restrictive. Thoughts?

[~manojg]:
{quote}
 HDFS-12203 on the similar lines for requesting the external provider to run in pass through
mode
{quote}
we need to know when to pass through. I think in your case, we are talking about letting snapshotDiff
to pass through, which is feasible because snapshotDiff is at NN side and it has access to
external attribute API. But for distcp issue we are talking about here, we have the problem
of not knowing when to pass through, because only users knows when to pass through and we
don't have a way to fill the gap between user (accessing FileSystem API only) and the pass
through API of external attribute provider. The solution proposed in this jira is one possible
way to fill the gap.

[~chris.douglas]:
{quote}
This is a pretty narrow use case.
{quote}
Agree. Only when external attribute provider is enabled, and in the context of distcp.

{quote}
Extending FileSystem is a very hard sell, since it would also add this flag to the protocol.
{quote}
Agree the change is wide, and I wish there is a simpler way.

{quote}
 Not only would this approach not work for old clusters, it would silently return the unfiltered
results. 
{quote}
Agree that any solution here would require a change on the old cluster.

{quote}
Moreover, every FileSystem other than HDFS wouldn't support this. Are there other use cases?
{quote}
Right now distcp with external attribute provider enabled is the only use case I can see.
Since distcp is a client that only access FileSystem API, I was proposing extending the API.
My thinking was, for FileSystems that don't care about this, the bypass parameter is simply
ignored.

Thanks again, and hope to hear your further thoughts...




.




> Provide new set of FileSystem API to bypass external attribute provider
> -----------------------------------------------------------------------
>
>                 Key: HDFS-12202
>                 URL: https://issues.apache.org/jira/browse/HDFS-12202
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs, hdfs-client
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>
> HDFS client uses 
> {code}
>   /**
>    * Return a file status object that represents the path.
>    * @param f The path we want information from
>    * @return a FileStatus object
>    * @throws FileNotFoundException when the path does not exist
>    * @throws IOException see specific implementation
>    */
>   public abstract FileStatus getFileStatus(Path f) throws IOException;
>   /**
>    * List the statuses of the files/directories in the given path if the path is
>    * a directory.
>    * <p>
>    * Does not guarantee to return the List of files/directories status in a
>    * sorted order.
>    * <p>
>    * Will not return null. Expect IOException upon access error.
>    * @param f given path
>    * @return the statuses of the files/directories in the given patch
>    * @throws FileNotFoundException when the path does not exist
>    * @throws IOException see specific implementation
>    */
>   public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException,
>                                                          IOException;
> {code}
> to get FileStatus of files.
> When external attribute provider (INodeAttributeProvider) is enabled for a cluster, the
 external attribute provider is consulted to get back some relevant info (including ACL, group
etc) and returned back in FileStatus, 
> There is a problem here, when we use distcp to copy files from srcCluster to tgtCluster,
if srcCluster has external attribute provider enabled, the data we copied would contain data
from attribute provider, which we may not want.
> Create this jira to add a new set of interface for distcp to use, so that distcp can
copy HDFS data only and bypass external attribute provider data.
> The new set API would look like
> {code}
>  /**
>    * Return a file status object that represents the path.
>    * @param f The path we want information from
>    * @param bypassExtAttrProvider if true, bypass external attr provider
>    *        when it's in use.
>    * @return a FileStatus object
>    * @throws FileNotFoundException when the path does not exist
>    * @throws IOException see specific implementation
>    */
>   public FileStatus getFileStatus(Path f,
>       final boolean bypassExtAttrProvider) throws IOException;
>   /**
>    * List the statuses of the files/directories in the given path if the path is
>    * a directory.
>    * <p>
>    * Does not guarantee to return the List of files/directories status in a
>    * sorted order.
>    * <p>
>    * Will not return null. Expect IOException upon access error.
>    * @param f
>    * @param bypassExtAttrProvider if true, bypass external attr provider
>    *        when it's in use.
>    * @return
>    * @throws FileNotFoundException
>    * @throws IOException
>    */
>   public FileStatus[] listStatus(Path f,
>       final boolean bypassExtAttrProvider) throws FileNotFoundException,
>                                                   IOException;
> {code}
> So when bypassExtAttrProvider is true, external attribute provider will be bypassed.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message