hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "SammiChen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism
Date Thu, 08 Feb 2018 08:15:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356629#comment-16356629

SammiChen commented on HADOOP-14999:

Hi [~uncleGen],  thanks for refine the patch. Here are a few comments.

1.  AliyunOSSFileSystemStore.

{color:#660e7a}uploadPartSize {color}= conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#660e7a}multipartThreshold {color}= conf.getLong({color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_KEY{color},
 partSize = conf.getLong({color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY{color},
 {color:#000080}if {color}(partSize < {color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE{color})
 partSize = {color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE{color};

 What's the difference usage of "uploadPartSize" and "partSize" with the same initial value?
It seems "partSize" is not used in other places.

Also please refine the multi upload related constant properties, put related property in adjacent
place. Seems "

{color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT{color}" should be called "{color:#660e7a}MULTIPART_UPLOAD_PART_SIZE_DEFAULT{color}". 
And "{color:#660e7a}MULTIPART_UPLOAD_SIZE {color}= {color:#0000ff}104857600{color}" is the
temp file size. Try to make the property name carries the accurate meaning.

{color:#808080}// Size of each of or multipart pieces in bytes{color}{color:#000080}public
static final {color}String {color:#660e7a}MULTIPART_UPLOAD_SIZE_KEY {color}=
 {color:#000080}public static final long {color}{color:#660e7a}MULTIPART_UPLOAD_SIZE {color}=
{color:#0000ff}104857600{color}; {color:#808080}// 100 MB{color}{color:#000080}public static
final long {color}{color:#660e7a}MULTIPART_UPLOAD_SIZE_DEFAULT {color}= {color:#0000ff}10
{color}* {color:#0000ff}1024 {color}* {color:#0000ff}1024{color};
 {color:#000080}public static final int {color}{color:#660e7a}MULTIPART_UPLOAD_PART_NUM_LIMIT
{color}= {color:#0000ff}10000{color};

{color:#808080}// Minimum size in bytes before we start a multipart uploads or copy{color}{color:#000080}public
static final {color}String {color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_KEY {color}=
 {color:#000080}public static final long {color}{color:#660e7a}MIN_MULTIPART_UPLOAD_THRESHOLD_DEFAULT
 {color:#0000ff}20 {color}* {color:#0000ff}1024 {color}* {color:#0000ff}1024{color};

{color:#000080}public static final long {color}{color:#660e7a}MIN_MULTIPART_UPLOAD_PART_SIZE
{color}= {color:#0000ff}100 {color}* {color:#0000ff}1024L{color};


2. AliyunOSSUtils#createTmpFileForWrite

    Change the order of following statements,

{color:#000080}if {color}({color:#660e7a}directoryAllocator {color}== {color:#000080}null{color})
 {color:#660e7a}directoryAllocator {color}= {color:#000080}new {color}LocalDirAllocator({color:#660e7a}BUFFER_DIR_KEY{color});
 {color:#000080}if {color}(conf.get({color:#660e7a}BUFFER_DIR_KEY{color}) == {color:#000080}null{color})
 conf.set({color:#660e7a}BUFFER_DIR_KEY{color}, conf.get({color:#008000}"hadoop.tmp.dir"{color})
+ {color:#008000}"/oss"{color});

Also is "{color:#660e7a}directoryAllocator{color}" final?

3. AliyunOSSUtils#intOption,  longOption

   Precondition doesn't support "%d".  Add test case to cover the logic. Suggest change
the name to more meaning full names like getXOption. Pay attention to the code style, the

4. TestAliyunOSSBlockOutputStream.  Add random length file tests here. Only 1024 aligned
file length is not enough.

5. AliyunOSSBlockOutputStream

   {color:#808080} Asynchronous multi-part based uploading mechanism to support huge file{color}{color:#808080}*
which is larger than 5GB.{color}

Where is this 5GB threshold checked in the code?

The resources are well cleaned after close() is called. But they are not cleaned when exception
happens during the write() process.


> AliyunOSS: provide one asynchronous multi-part based uploading mechanism
> ------------------------------------------------------------------------
>                 Key: HADOOP-14999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14999
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0-beta1
>            Reporter: Genmao Yu
>            Assignee: Genmao Yu
>            Priority: Major
>         Attachments: HADOOP-14999.001.patch, HADOOP-14999.002.patch, HADOOP-14999.003.patch,
HADOOP-14999.004.patch, HADOOP-14999.005.patch, HADOOP-14999.006.patch, HADOOP-14999.007.patch,
> This mechanism is designed for uploading file in parallel and asynchronously:
>  - improve the performance of uploading file to OSS server. Firstly, this mechanism splits
result to multiple small blocks and upload them in parallel. Then, getting result and uploading
blocks are asynchronous.
>  - avoid buffering too large result into local disk. To cite an extreme example, there
is a task which will output 100GB or even larger, we may need to output this 100GB to local
disk and then upload it. Sometimes, it is inefficient and limited to disk space.
> This patch reuse {{SemaphoredDelegatingExecutor}} as executor service and depends on
> Attached {{asynchronous_file_uploading.pdf}} illustrated the difference between previous
{{AliyunOSSOutputStream}} and {{AliyunOSSBlockOutputStream}}, i.e. this asynchronous multi-part
based uploading mechanism.
> 1. {{AliyunOSSOutputStream}}: we need to output the whole result to local disk before
we can upload it to OSS. This will poses two problems:
>  - if the output file is too large, it will run out of the local disk.
>  - if the output file is too large, task will wait long time to upload result to OSS
before finish, wasting much compute resource.
> 2. {{AliyunOSSBlockOutputStream}}: we cut the task output into small blocks, i.e. some
small local file, and each block will be packaged into a uploading task. These tasks will
be submitted into {{SemaphoredDelegatingExecutor}}. {{SemaphoredDelegatingExecutor}} will
upload this blocks in parallel, this will improve performance greatly.
> 3. Each task will retry 3 times to upload block to Aliyun OSS. If one of those tasks
failed, the whole file uploading will failed, and we will abort current uploading.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message