hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Waldmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14444) New implementation of ftp and sftp filesystems
Date Sun, 28 May 2017 10:07:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027771#comment-16027771

Lukas Waldmann commented on HADOOP-14444:

I don't have any exact performance test. I may try to do one but generally for filesystem
with huge number of files the new implementation is way more effective. If for nothing else
than because in current implementation download of each file requires at least 2 "connects"
- if you have several thousands of files the overhead is extraordinary. I am using the new
FS mostly for distcp and  from my experience download of aprox 40000 files  took about 6 hours
to even start  while with new one and pooled connections it started in around 20min

add 1) it can but for another download with about 90000 files i didn't have to fiddle with
java memory parameters. In any case you can always disable it
add 2) yes but i don't think anybody sane would use ftp/sftp on "lively" filesystem. Main
point of those fs is (at least for me) use with distcp. Again as i said you can disable caching
and there is provision to figure out if a file was changed during download

> New implementation of ftp and sftp filesystems
> ----------------------------------------------
>                 Key: HADOOP-14444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14444
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 2.8.0
>            Reporter: Lukas Waldmann
>            Assignee: Lukas Waldmann
>         Attachments: HADOOP-14444.patch
> Current implementation of FTP and SFTP filesystems have severe limitations and performance
issues when dealing with high number of files. Mine patch solve those issues and integrate
both filesystems such a way that most of the core functionality is common for both and therefore
simplifying the maintainability.
> The core features:
> * Support for HTTP/SOCKS proxies
> * Support for passive FTP
> * Support of connection pooling - new connection is not created for every single command
but reused from the pool.
> For huge number of files it shows order of magnitude performance improvement over not
pooled connections.
> * Caching of directory trees. For ftp you always need to list whole directory whenever
you ask information about particular file.
> Again for huge number of files it shows order of magnitude performance improvement over
not cached connections.
> * Support of keep alive (NOOP) messages to avoid connection drops
> * Support for Unix style or regexp wildcard glob - useful for listing a particular files
across whole directory tree
> * Support for reestablishing broken ftp data transfers - can happen surprisingly often

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message