hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6984) In Hadoop 3, make FileStatus serialize itself via protobuf
Date Thu, 04 May 2017 17:59:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997143#comment-15997143

Chris Douglas commented on HDFS-6984:

Thanks for taking a look, Andrew.

bq.  Looks like this latest patch has grown some new functionality.
That's not intended. It includes [~stevel@apache.org]'s request that the user-facing types
be {{Serializable}}, and your request to move the flags out of {{FsPermissionExtension}},
deprecating Writable APIs in/around FileStatus.

Versions before v009 tried to avoid a set of flags for features in HdfsFileStatus and FileStatus,
but ACLs/encryption rely on the FsPermissionExtension bits, not the PB record. I tried variants
that used a placeholder, "remote" stub in the PB field that's sufficient to fill out the FileStatus
API, but there are too many required fields in the PB messages to make that efficient or clean.
ACLs, encryption, and EC are all handled slightly differently. Instead of adding to client
impl complexity, v009 gives up and adds a set of flags.

By new functionality, are you referring to anything beyond making HdfsFileStatus a subtype
of FileStatus?

bq. I'd like to keep setting the bits server-side. There's no guarantee whether 2.9.0 or 3.0.0
will GA first, and if we have the option of avoiding incompatibility, I'd like to take it.
Agreed. I hope we eventually remove this, or at least stop adding new flags to FsPermission.

bq. Could you give a pointer to this? These bits are new to the JSON code, so it'd be good
to catch any bugs quickly.
is the relevant line. It's consistent with the documentation- that _is_ the FsPermission\[Extension\]
for that path- but including the flag bits in the AclStatus permission field looks more like
an accident of the implementation, not part of its intended API.

bq. I'm worried about making HdfsFileStatus extend FileStatus, particularly the link target.
HdfsFileStatus is used on the NameNode, and there is potentially some mangling from turning
the link target String into a Path. The target is supposed to be an opaque value within the
filesystem, left to be interpreted by the client.
For this particular case, shouldn't the caller be using {{HdfsFileStatus::getSymlinkInBytes}}
in contexts where it's opaque? I share your concern, but extending FileStatus also required
touching code that didn't expect (spurious) IOExceptions. In those and other existing contexts,
I didn't find any instances where the String wasn't converted to a Path. Converting the byte[]
to a String may also mangle it somewhat.

To treat it as opaque for the patch, I'll change the serialization to use the {{getSymlinkInBytes}}
conversion, rather than {{Path::toString}}. I'll also go through its uses, and try to find
all the cases that don't immediately convert the String to a Path.

bq. How bad is it to split this change out?
I've split this twice already, to fix concerns over DistCp and for {{Serializable}}. We can
change the scope of the JIRA, but (IMHO) these changes are related, and implied by previous
review requests.

bq. Wasn't clear what we gain by having HdfsFileStatus extend FileStatus
The case [~sershe] raised for HDFS-7878 involved passing a handle across process boundaries.
If a PathHandle is required to store sufficient state to open the file, then it needs to copy
(and serialize) data from HdfsFileStatus that, in most cases, will not be used. We can construct
this on-demand, if the HdfsFileStatus is kept intact. There's also the minor inefficiency
of creating a new FileStatus instance and copying fields, but 1) that's often incurred anyway
because the Located\* types are irreconcilable and 2) in the noise.

FileStatus is exposing attributes that (AFAIK) only apply to HDFS (e.g., EC, encryption);
the types are 1:1 mappings right now. If HDFS wants to pass additional information, or encode
it differently, then it can return a different data type (perhaps with this as a field). As
it's used inside the NN, it lazily binds some fields, but I don't know of a use that's inconsistent
with the POD type. I wish FileStatus were an interface and not a concrete class, but that
ship has sailed.

If you're really uncomfortable with the change, then I can try to roll it back.

bq. HttpsFileSystem: Please do not use a wildcard import
Recently switched to Intellij; not intended. Will fix my settings.

> In Hadoop 3, make FileStatus serialize itself via protobuf
> ----------------------------------------------------------
>                 Key: HDFS-6984
>                 URL: https://issues.apache.org/jira/browse/HDFS-6984
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Colin P. McCabe
>            Assignee: Colin P. McCabe
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-6984.001.patch, HDFS-6984.002.patch, HDFS-6984.003.patch, HDFS-6984.004.patch,
HDFS-6984.005.patch, HDFS-6984.006.patch, HDFS-6984.007.patch, HDFS-6984.008.patch, HDFS-6984.009.patch,
> FileStatus was a Writable in Hadoop 2 and earlier.  Originally, we used this to serialize
it and send it over the wire.  But in Hadoop 2 and later, we have the protobuf {{HdfsFileStatusProto}}
which serves to serialize this information.  The protobuf form is preferable, since it allows
us to add new fields in a backwards-compatible way.  Another issue is that already a lot of
subclasses of FileStatus don't override the Writable methods of the superclass, breaking the
interface contract that read(status.write) should be equal to the original status.
> In Hadoop 3, we should just make FileStatus serialize itself via protobuf so that we
don't have to deal with these issues.  It's probably too late to do this in Hadoop 2, since
user code may be relying on the existing FileStatus serialization there.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message