hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Demoor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems
Date Fri, 05 Aug 2016 14:42:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409534#comment-15409534
] 

Thomas Demoor edited comment on HADOOP-9565 at 8/5/16 2:41 PM:
---------------------------------------------------------------

Steve the "avoid data write" thing you mention is exactly why these direct outputcommitters
(and what I did for the FileOutputCommitter) work on object stores. Multiple writers can write
to the same object concurrently. At any point, the last-started successfully-completed write
is what is visible.

Regular put: 
* Content length (=N) communicated at start of request. 
* Once N bytes hit S3 the object becomes visible
* If hadoop task aborts before writing N bytes the upload will timeout and the object version
is garbage collected by S3. 

MulitpartUpload:
* Requires explicit API call to complete (or abort)
* Only when complete API call is used the object becomes visible
* If hadoop task fails the upload will remain to be active (s3a has the purge functionality
to automatically clean these up after a certain period) but the object is NOT visible

The interesting thing to think about are network partitions.





was (Author: thomas demoor):
Steve the "avoid data write" thing you mention is exactly why these direct outputcommitters
(and what I did for the FileOutputCommitter) work on object stores. Multiple writers can write
to the same object concurrently. At any point, the last-started successfully-completed write
is what is visible.

Regular put: 
* Content length (=N) communicated at start of request. 
* Once N bytes hit S3 the object becomes visible
* If hadoop task aborts before writing N bytes the upload will timeout and the object version
is garbage collected by S3. 
MulitpartUpload:
* Requires explicit API call to complete (or abort)
* Only when complete API call is used the object becomes visible
* If hadoop task fails the upload will remain to be active (s3a has the purge functionality
to automatically clean these up after a certain period) but the object is NOT visible

The interesting thing to think about are network partitions.




> Add a Blobstore interface to add to blobstore FileSystems
> ---------------------------------------------------------
>
>                 Key: HADOOP-9565
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9565
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs, fs/s3, fs/swift
>    Affects Versions: 2.6.0
>            Reporter: Steve Loughran
>            Assignee: Pieter Reuse
>         Attachments: HADOOP-9565-001.patch, HADOOP-9565-002.patch, HADOOP-9565-003.patch,
HADOOP-9565-004.patch, HADOOP-9565-005.patch, HADOOP-9565-006.patch, HADOOP-9565-branch-2-007.patch
>
>
> We can make the fact that some {{FileSystem}} implementations are really blobstores,
with different atomicity and consistency guarantees, by adding a {{Blobstore}} interface to
add to them. 
> This could also be a place to add a {{Copy(Path,Path)}} method, assuming that all blobstores
implement at server-side copy operation as a substitute for rename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message