hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7878) API - expose an unique file identifier
Date Fri, 13 Mar 2015 23:24:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361303#comment-14361303
] 

Colin Patrick McCabe edited comment on HDFS-7878 at 3/13/15 11:24 PM:
----------------------------------------------------------------------

bq. Jing wrote: Could you please add more details here? Note that the getFileId API in the
current patch only calls getFileStatus and returns the inode id field contained in the HdfsFileStatus.
Or you mean the client is making both calls separately? Then why the subclass approach can
solve this?

My point is that if the client makes two different calls to getFileStatus, the file status
could change in between.  So we could end up with the ID of one file and the other details
of another file.  This is also inefficient, clearly, since we're doing 2x the RPCs to the
NameNode that we need to.  And since the NN is the hardest part of HDFS to scale (it hasn't
been scaled horizontally) this is another concern.

bq. If you call getFileStatus and open currently, you can have the same problem - status from
one file, open from different file.

Sure, and we ought to fix this too, by making it possible for the client to get {{FileStatus}}
from a {{DFSInputStream}}.  It would be as easy as just having a method inside DFSInputStream
that called {{open(/.reserved/.inodes/<inode-id-of-file)}}.

bq. Sergey wrote: ID allows to overcome this by getting ID first, then using ID-based path.
Of course if ID is obtained separately there's no guarantee but there's no way to overcome
this.

It seems like there is a very easy way to overcome this... just add an abstract function inside
{{FileStatus}} that either throws {{OperationNotSupported}} or returns the inode ID.  Then
FileStatus objects returned from HDFS (and any other filesystem that has user-visible inode
IDs) can return the inode ID, and the default implementation can be throwing {{OperationNotSupported}}.
 We do 1/2 the RPCs of the current patch, put 1/2 the load on the NN, and don't open up another
race condition.

What do you think?


was (Author: cmccabe):
bq. Jing wrote: Could you please add more details here? Note that the getFileId API in the
current patch only calls getFileStatus and returns the inode id field contained in the HdfsFileStatus.
Or you mean the client is making both calls separately? Then why the subclass approach can
solve this?

My point is that if the client makes two different calls to getFileStatus, the file status
could change in between.  So we could end up with the ID of one file and the other details
of another file.  This is also inefficient, clearly, since we're doing 2x the RPCs to the
NameNode that we need to.  And since the NN is the hardest part of HDFS to scale (it hasn't
been scaled horizontally) this is another concern.

bq. If you call getFileStatus and open currently, you can have the same problem - status from
one file, open from different file.

Sure, and we ought to fix this too, by making it possible for the client to get {{FileStatus}}
from a {{DFSInputStream}}.  It would be as easy as just having a method inside DFSInputStream
that called {{open(/.reserved/.inodes/<inode-id-of-file)}}.

bq. Sergey wrote: ID allows to overcome this by getting ID first, then using ID-based path.
Of course if ID is obtained separately there's no guarantee but there's no way to overcome
this.

It seems like there is a very easy way to overcome this... just add an abstract function inside
{{FileStatus}} that either throws {{OperationNotSupported}} or returns the inode ID.  Then
FileStatus objects returned from HDFS (and any other function that has user-visible inode
IDs) can return the inode ID, and the default implementation can be throwing {{OperationNotSupported}}.
 We do 1/2 the RPCs of the current patch, put 1/2 the load on the NN, and don't open up another
race condition.

What do you think?

> API - expose an unique file identifier
> --------------------------------------
>
>                 Key: HDFS-7878
>                 URL: https://issues.apache.org/jira/browse/HDFS-7878
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch
>
>
> See HDFS-487.
> Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA
it supposedly duplicates.
> INode ID for the file should be easy to expose; alternatively ID could be derived from
block IDs, to account for appends...
> This is useful e.g. for cache key by file, to make sure cache stays correct when file
is overwritten.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message