hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14444) New implementation of ftp and sftp filesystems
Date Mon, 27 Nov 2017 18:46:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267229#comment-16267229

Steve Loughran commented on HADOOP-14444:

quick review

Overall: looks like a good quality piece of code: well, written, tests. very nice docs, and
lots of proxy support.

* doesn't like applying to trunk; I had to fiddle with the poms, but there's also a clientkeystore
file which I couldn't handle. Is that needed? Could it be generated?
* doesn't compile either. 


* why is the FTP test skipped on Windows?
* we tend to use `setup()` `teardown()` as the @Before/@after operations in filesystems. Having
standard names makes it more consistent when subclassing...and having >1 before/after method
puts you into ambiguous ordering. Fix: change the names, subclass as appropriate, calling
the superclass method as desired.
* like what you've done with the mixin to reuse all the tests, but I'd prefer a name more
unique to the FS than ContractTestBase. FTPContractTestMixin?
* Never thought about having `AbstractFSContract createContract()` raise an IOE. We could
add that to its signature (best in a separate JIRA)
* You are importing the distcp tests but not using them. What's your plan there? Get this
patch in and then add that as the next iteration?


* readme should go into src/site/org/apache/hadoop/ftpextended/index.md

Misc minor points

* need to rename AbstractFileSystem to a class which isn't used elsewhere, e.g AbstractFTPFileSystem
* use try-with-resources areound channel logic and have the implicit channel.close() do the
* There's lots of opportunities to use subclasses of IOE where it is useful to provide more
meaningful failures. 
* the style guidelines have conventions on import ordering we strive to maintain, especially
on new code
* hadoop code prefers a space after // in comments; a search & replace should fix
* org/apache/hadoop/fs/ftpextended/ftp/package-info.java should declare code as @Private+Unstable.
Even if the FS is public, there's no API coming from this module, nor stability guarantees.
* Unless it's going to leak passwords, error messages should try and include the filesystem
URI in them. Why? helps debugging when the job is working with >1 FS and all you have is
a log to go on
* When wrapping library exceptions (e.g SFTP exceptions), always include the toString() value
of the wrapped exception. It'll be the string most likely to make it to bug reports.
* core-site.xml mentions s3

h3. Migration

I don't know what the good story for migration here is. Given how much better this is than
the fairly basic ftp client there is today, we've no reason not to tell everyone to move to
it. The hard part is how to 

* this works best if the previous options still all work.

h3. Security

I'm moving to a world where we have to provide security audits of sensitive patches, which
this is. 

What's the security mechanism here? 
* Is Configuration.getPassword() used to get secrets through JCEKS files?
* I see that user:password is supported. I don't like this. I guess given its only FTP it
doesn't matter that much, but on SFTP it does
* And on the topic of SFTP, what to do there?

The docs will need a section in this, with the overall theme being telling people how to use
it securely

> New implementation of ftp and sftp filesystems
> ----------------------------------------------
>                 Key: HADOOP-14444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14444
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.8.0
>            Reporter: Lukas Waldmann
>            Assignee: Lukas Waldmann
>         Attachments: HADOOP-14444.10.patch, HADOOP-14444.2.patch, HADOOP-14444.3.patch,
HADOOP-14444.4.patch, HADOOP-14444.5.patch, HADOOP-14444.6.patch, HADOOP-14444.7.patch, HADOOP-14444.8.patch,
HADOOP-14444.9.patch, HADOOP-14444.patch
> Current implementation of FTP and SFTP filesystems have severe limitations and performance
issues when dealing with high number of files. Mine patch solve those issues and integrate
both filesystems such a way that most of the core functionality is common for both and therefore
simplifying the maintainability.
> The core features:
> * Support for HTTP/SOCKS proxies
> * Support for passive FTP
> * Support for explicit FTPS (SSL/TLS)
> * Support of connection pooling - new connection is not created for every single command
but reused from the pool.
> For huge number of files it shows order of magnitude performance improvement over not
pooled connections.
> * Caching of directory trees. For ftp you always need to list whole directory whenever
you ask information about particular file.
> Again for huge number of files it shows order of magnitude performance improvement over
not cached connections.
> * Support of keep alive (NOOP) messages to avoid connection drops
> * Support for Unix style or regexp wildcard glob - useful for listing a particular files
across whole directory tree
> * Support for reestablishing broken ftp data transfers - can happen surprisingly often

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message