hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Demoor (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11684) S3a to use thread pool that blocks clients
Date Thu, 01 Oct 2015 09:54:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939600#comment-14939600
] 

Thomas Demoor commented on HADOOP-11684:
----------------------------------------

S3a has 2 modes for uploading: 
* fs.s3a.fast.upload=false (default): S3AOutputStream.java
** files are buffered to local disk first, on fs.close() the upload to S3 is initiated
** similar behaviour to s3n, other 3d party filesystems
** downsides: throughput of local disk, remaining space on local disk, delayed start of upload

* fs.s3a.fast.upload=true: S3AFastOutputStream.java
** Hadoop writes are buffered in memory, if written data > threshold: multipart is initiated,
uploading multiple parts in parallel in different *threads* (as soon as the data is in memory)
** EMR probably does something similar
** in this mode, fs.s3a.multipart.size should be set to something like 64 or 128MB, similar
to hdfs block size.
** downsides: buffers data in memory inside JVM (~ fs.s3a.multipart.size * (fs.s3a.threads.max
+ fs.s3a.max.total.tasks) +1 ), HADOOP-12387 will improve memory management

In fast mode, more threads / queued parts improve parallelism but require additional memory
buffer space. Setting max.total.tasks=1000 certainly runs the JVM OOM here, as do applications
that write files from separate threads (with CallerRuns, not with Blocking Threadpool). In
default mode, the threadpool is used by the AWS SDK TransferManager.

Indeed, the blocking threadpool is non-trivial (semaphores,...) and thus higher-risk. Is there
similar code in HDFS we could inspect / reuse?


> S3a to use thread pool that blocks clients
> ------------------------------------------
>
>                 Key: HADOOP-11684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11684
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.0
>            Reporter: Thomas Demoor
>            Assignee: Thomas Demoor
>         Attachments: HADOOP-11684-001.patch, HADOOP-11684-002.patch, HADOOP-11684-003.patch
>
>
> Currently, if fs.s3a.max.total.tasks are queued and another (part)upload wants to start,
a RejectedExecutionException is thrown. 
> We should use a threadpool that blocks clients, nicely throtthling them, rather than
throwing an exception. F.i. something similar to https://github.com/apache/incubator-s4/blob/master/subprojects/s4-comm/src/main/java/org/apache/s4/comm/staging/BlockingThreadPoolExecutorService.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message