hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16570) S3A committers leak threads/raises OOM on job/task commit at scale
Date Thu, 19 Sep 2019 12:18:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933323#comment-16933323
] 

Steve Loughran commented on HADOOP-16570:
-----------------------------------------

h2. Plan

h3. now: 

* listPendingUploads lists all pendingset files, loads these JSON files (in separate threasds)
 to produce a single list of all files to commit or abort.
* commit/abort is done in the thread pool
* _SUCCESS file lists all the written files; gets used in tests to verify that (a) number
of files >0 and b that the filenames match the store state, expected values, etc.

h3. proposed

* list all .pendingset files
* hand off load and commit/abort to the threads so the no. of actively loaded files is limited
to the #of active threads.
* limit size of success file to first, say, 500 entries; a counter field will be updated to
give the final number.  This will be enough for all integration tests that use the file that
I know of.

h2. Troublespots

h3. Handling failure to load a file, i.e rolling back the commit

* We currently encounter all failures to load any file before a single upload has been committed.
With incremental load and commit, that doesn't hold any more and it will fail partway through
the operation.
* we currently roll back failures to commit during the phase where we completely uploads,
by deleting the files. For that we need the entire list of committed files.

h3. Partition directory output

The partitioned committer uses the list of pending uploads to identify leaf directories and
apply its policy to them; 

* the Fail policy fails to commit before a single file has been written.
* the Replace policy deletes all the files in those directories
* the append policy doesn't care.

The only way to implement the same checks with incremental loads Will be to do an initial
scan to build up the tree of leaf directories and then apply the chosen policy.

Together this implies we'll probably have to do an initial preload scan of all pending files,
at least for that partitioned committer. It's the one where we don't want to write a single
file if there are problems, and we need to build that tree up.

The other committers can react to failures during incremental commits more simply:

1. Abort all pending MPUs under the output directory.
2. Delete all files under the output directory.

I'll have to think about how best to restructure the code to do this, but it is possible.


> S3A committers leak threads/raises OOM on job/task commit at scale
> ------------------------------------------------------------------
>
>                 Key: HADOOP-16570
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16570
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0, 3.1.2
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>
> The fixed size ThreadPool created in AbstractS3ACommitter doesn't get cleaned up at EOL;
as a result you leak the no. of threads set in "fs.s3a.committer.threads"
> Not visible in MR/distcp jobs, but ultimately causes OOM on Spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message