spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nezih Yigitbasi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation
Date Mon, 15 Feb 2016 23:31:18 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896
] 

Nezih Yigitbasi commented on SPARK-13328:
-----------------------------------------

Although this long time can be reduced by decreasing the values of the {{spark.shuffle.io.maxRetries}}
and {{spark.shuffle.io.retryWait}} parameters it may not be desirable to reduce # of retries
globally and also reducing retry wait may increase the load on the serving block manager.


I already have a fix where I added a new config parameter {{spark.block.failures.beforeLocationRefresh}}
that determines when to refresh the list of block locations from the driver while going through
all these locations. In my fix this parameter is honored only when dynamic allocation is enabled
and I set its default value to Int.MaxValue so that it doesn't change the behavior even if
dynamic alloc. is enabled (as refreshing the location may not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-13328
>                 URL: https://issues.apache.org/jira/browse/SPARK-13328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from removed
executors were causing job failures and SPARK-9591 fixed this problem by trying all locations
of a block before giving up. However, the locations of a block is retrieved only once from
the driver in this process and the locations in this list can be stale due to dynamic resource
allocation. This situation gets worse when running on a large cluster as the size of this
location list can be in the order of several hundreds out of which there may be tens of stale
entries. What we have observed is with the default settings of 3 max retries and 5s between
retries (that's 15s per location) the time it takes to read a broadcast variable can be as
high as ~17m (below log shows the failed 70th block fetch attempt where each attempt takes
15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block broadcast_18_piece0
from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 18 took
1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message