flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "huntercc (Jira)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar
Date Wed, 15 Sep 2021 12:53:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

huntercc updated FLINK-24293:
    External issue URL:   (was: https://github.com/apache/flink/pull/17289)

> Tasks from the same job on a machine share user jar 
> ----------------------------------------------------
>                 Key: FLINK-24293
>                 URL: https://issues.apache.org/jira/browse/FLINK-24293
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-09-15-20-43-11-758.png, image-2021-09-15-20-43-17-304.png
> In the current blob storage design, tasks executed by the same TaskExecutor will share
BLOBs storage dir and tasks executed by different TaskExecutor use different dir. As a result,
a TaskExecutor has to download user jar even if there has been the same user jar downloaded
by other TaskExecutors on the machine. We believe that there is no need to download many copies
of the same user jar to the local, two main problems will by exposed:
>  # The NIC bandwidth of the distribution terminal may become a bottleneck    !image-2021-09-15-20-43-17-304.png|width=695,height=193!
As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC bandwidth is used when
we launched a flink job with 4000 TaskManagers, which will cause a long deployment time and
akka timeout exception.
>  # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks from the same
job on a machine to share blob storage dir, more specifically, share the user jar in the dir.
Only one task deployed to the machine will download the user jar from BLOB server or distributed
file storage, and the subsequent tasks just use the localized user jar. In this way, the user
jar of one job only needs to be downloaded once on a machine. Here is a comparison of job
startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|

This message was sent by Atlassian Jira

View raw message