hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13786) Add S3Guard committer for zero-rename commits to consistent S3 endpoints
Date Mon, 27 Mar 2017 19:22:42 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Loughran updated HADOOP-13786:
------------------------------------
    Attachment: HADOOP-13786-HADOOP-13345-021.patch

HADOOP-13786 patch 021: unifying data structures and s3 client use

This iteration is starting to move towards using the S3AFS writeOoperations, with the first
step being: unified JSON persistence across alll committers
* New serializable type, MultiplePendingCommits , containing a list of SinglePendingCommit
instances; the latter being the things the magic committer saves. 
* SinglePendingCommit adds fields/operations needed by staging committer. It's a bit ungainly
right now; view as an interim step.
* Staging committers and tests all moved over to this datatype instead of seralized java stream.
Good: debugging, security, validation logic. Bad: JSON serialization overhead.
* in the move, switches the various lists that the thread-pooled staging code buids up to
being synchronized lists. I think there may have been risk of race conditions there. 

Other changes
* default unique name option == false
* tests can handle option of unique vs non-unique filenames
* and the partition committer skips the Mapper test. Doesn't make sense.

Essentinally: the unique name algorithm doesn't work withy map tasks, as they expect a part-m-0000/
dir with children explicitly named "index" and "data". Adding unique names under index and
data breaks this.

I'm still undecided about what the best default value is here, more insight and experimentation
needed


> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-13786
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13786
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/s3
>    Affects Versions: HADOOP-13345
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13786-HADOOP-13345-001.patch, HADOOP-13786-HADOOP-13345-002.patch,
HADOOP-13786-HADOOP-13345-003.patch, HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch,
HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-007.patch,
HADOOP-13786-HADOOP-13345-009.patch, HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch,
HADOOP-13786-HADOOP-13345-012.patch, HADOOP-13786-HADOOP-13345-013.patch, HADOOP-13786-HADOOP-13345-015.patch,
HADOOP-13786-HADOOP-13345-016.patch, HADOOP-13786-HADOOP-13345-017.patch, HADOOP-13786-HADOOP-13345-018.patch,
HADOOP-13786-HADOOP-13345-019.patch, HADOOP-13786-HADOOP-13345-020.patch, HADOOP-13786-HADOOP-13345-021.patch,
s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the presence of failures".
Implement it, including whatever is needed to demonstrate the correctness of the algorithm.
(that is, assuming that s3guard provides a consistent view of the presence/absence of blobs,
show that we can commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output streams (ie.
not visible until the close()), if we need to use that to allow us to abort commit operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message