hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14444) New implementation of ftp and sftp filesystems
Date Wed, 05 Jul 2017 10:04:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074523#comment-16074523
] 

Steve Loughran commented on HADOOP-14444:
-----------------------------------------

I am watching this, but not putting any effort into looking at the code right now. Happy that
the two of you are working together to come up with something which addresses your needs.

# You don't need to have every feature in immediately, have one up to the level where it works
slightly better than the current one, enough for it to be alongside the older version for
one release, then cut the other version once stable (s3a, wasb, ADL, all have a one-release-to-stabilise
experience).
# regarding caching, I'd go for a name like {{fs.ftp.cache.host}}, with the host value coming
last. Otherwise you get into trouble with other options in future if a hostname matches it.

Now, a quick scan through the latest patch



h2. Build

* all settings for things like java versions, artifact versions should be picked up from the
base hadoop-project/pom.xml ... we need to manage everything in one place

h2. Tests

I like the tests; these are a key part of any new feature

* Use {{GenericTestUtils}} to work with logs; there's ongoing changes there for better SLF4J
integration & log capture. Please avoid using log4j API calls direct
* Add a test timeout rule to {{TestAbstractFTPFileSystem}}, name it {{AbstractFTPFileSystemTest}}.

* Every test suite starting Test* should be able to be executed by yetus/jenkins, without
any ftp server
* Everything with Test* can be started without any endpoint configured, right?
* Use {{ContractTestUtils}} to work with filesystems and assert about them (more diags on
failure), especially for the {{assertPathExists() kind of assertion, which yuio can move to
for things like testFileExists()}}
* and use SLF4J logging, not {{System.err}}
* All assertTrue/assertFalse asserts should have a meaningful string, ideally even assertEquals.
One trick: have the toString() value of the fs provide some details on the connection, so
you can include it in the asserts. Another, pull out things like {{assertChannelConnected()}}
and have the text in one place
* {{TestConnectionPool.testGetChannelFromClosedFS}}. If the unexpected IOE is caught, make
it the inner cause of the AssertionError raised. 
* Lot of duplication in the contract test createContract() calls...could that be shared somehow?
* Have some isolated tests for the cache


> New implementation of ftp and sftp filesystems
> ----------------------------------------------
>
>                 Key: HADOOP-14444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14444
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.8.0
>            Reporter: Lukas Waldmann
>            Assignee: Lukas Waldmann
>         Attachments: HADOOP-14444.2.patch, HADOOP-14444.3.patch, HADOOP-14444.4.patch,
HADOOP-14444.5.patch, HADOOP-14444.patch
>
>
> Current implementation of FTP and SFTP filesystems have severe limitations and performance
issues when dealing with high number of files. Mine patch solve those issues and integrate
both filesystems such a way that most of the core functionality is common for both and therefore
simplifying the maintainability.
> The core features:
> * Support for HTTP/SOCKS proxies
> * Support for passive FTP
> * Support of connection pooling - new connection is not created for every single command
but reused from the pool.
> For huge number of files it shows order of magnitude performance improvement over not
pooled connections.
> * Caching of directory trees. For ftp you always need to list whole directory whenever
you ask information about particular file.
> Again for huge number of files it shows order of magnitude performance improvement over
not cached connections.
> * Support of keep alive (NOOP) messages to avoid connection drops
> * Support for Unix style or regexp wildcard glob - useful for listing a particular files
across whole directory tree
> * Support for reestablishing broken ftp data transfers - can happen surprisingly often



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message