hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wujinhu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-15027) Improvements for Hadoop read from AliyunOSS
Date Mon, 13 Nov 2017 06:52:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249160#comment-16249160
] 

wujinhu edited comment on HADOOP-15027 at 11/13/17 6:51 AM:
------------------------------------------------------------

[~uncleGen]
Thanks for your comments.
1. I have tested by using ossutil tool and the read speed is about 10MB+/s(continue to verify
this).
2 & 3.
I think it's ok if thread pool is in FileSystem. Thread pool is just used to per-fetch data
from OSS. Actually, just as the following code shows

bq.      if (item.buffer.length == 0) {
        //EOF
        item.ready.set(true);
      } else {
        this.readAheadExecutorService.execute(new AliyunOSSFileReaderTask(key, store, item));
      }
      cachedStreams.add(item);

each item will be enqueue both thread pool(FileSystem) and cachedStreams(Each stream has its
own queue).
If one input stream is slow, it just affect its own cachedStreams, and will not affect others.

4. I will change code style of these lines.
5. Yes, we can do a simple refactor if some modules have the same requirements.

I will add another patch to fix this.


was (Author: wujinhu):
[~uncleGen]
Thanks for your comments.
1. I have tested by using ossutil tool and the read speed is about 10MB+/s(continue to verify
this).
2 & 3.
I think it's ok if thread pool is in FileSystem. Thread pool is just used to per-fetch data
from OSS. Actually, just as the following code shows

      {{*{color:#d04437}if (item.buffer.length == 0) {
        //EOF
        item.ready.set(true);
      } else {
        this.readAheadExecutorService.execute(new AliyunOSSFileReaderTask(key, store, item));
      }
      cachedStreams.add(item);{color}*}}

each item will be enqueue both thread pool(FileSystem) and cachedStreams(Each stream has its
own queue).
If one input stream is slow, it just affect its own cachedStreams, and will not affect others.

4. I will change code style of these lines.
5. Yes, we can do a simple refactor if some modules have the same requirements.

I will add another patch to fix this.

> Improvements for Hadoop read from AliyunOSS
> -------------------------------------------
>
>                 Key: HADOOP-15027
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15027
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0
>            Reporter: wujinhu
>            Assignee: wujinhu
>         Attachments: HADOOP-15027.001.patch, HADOOP-15027.002.patch
>
>
> Currently, read performance is poor when Hadoop reads from AliyunOSS. It needs about
1min to read 1GB from OSS.
> Class AliyunOSSInputStream uses single thread to read data from AliyunOSS,  so we can
refactor this by using multi-thread pre read to improve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message