hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wujinhu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-15262) AliyunOSS: rename() to move files in a directory in parallel
Date Thu, 08 Mar 2018 02:48:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390620#comment-16390620
] 

wujinhu edited comment on HADOOP-15262 at 3/8/18 2:47 AM:
----------------------------------------------------------

Attach HADOOP-15262.005.patch! Set a upper limit to waiting list size.

With this patch resolved, users can improve copy performance by increase *fs.oss.max.copy.threads*
and *fs.oss.max.copy.tasks.per.dir*(Old version copies directory in series). Generally*, *the
greater of the *fs.oss.max.copy.threads* and ** *fs.oss.max.copy.tasks.per.dir***, the better(if
we have enough resources)

For example, if we set *fs.oss.max.copy.threads = 5* and ** *fs.oss.max.copy.tasks.per.dir
= 5*, the copy time will reduce to 1/5 of old version rename(). **

Here is one use case that drives us to have this improvement.

Users use spark/tensorFlow/..... to train models and save models file to OSS. However, the
number of the model files is large, so it will be slow when committing jobs because frameworks will
call rename().

 


was (Author: wujinhu):
Attach HADOOP-15262.005.patch! Set a upper limit to waiting list size.

With this patch resolved, users can improve copy performance by increase *fs.oss.max.copy.threads*
and *fs.oss.max.copy.tasks.per.dir*(Old version copies directory in series). Generally*,* the
greater of the *fs.oss.max.copy.threads* and ** *fs.oss.max.copy.tasks.per.dir***, the better(if
we have enough resources)

Here is one use case that drives us to have this improvement.

Users use spark/tensorFlow/..... to train models and save models file to OSS. However, the
number of the model files is large, so it will be slow when committing jobs because frameworks will
call rename().

 

> AliyunOSS: rename() to move files in a directory in parallel
> ------------------------------------------------------------
>
>                 Key: HADOOP-15262
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15262
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0
>            Reporter: wujinhu
>            Assignee: wujinhu
>            Priority: Major
>             Fix For: 3.1.0, 2.9.1, 3.0.1
>
>         Attachments: HADOOP-15262.001.patch, HADOOP-15262.002.patch, HADOOP-15262.003.patch,
HADOOP-15262.004.patch, HADOOP-15262.005.patch
>
>
> Currently, rename() operation renames files in series. This will be slow if a directory
contains many files. So we can improve this by rename files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message