spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Davidson (JIRA)" <>
Subject [jira] [Commented] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
Date Sat, 18 Apr 2015 03:11:58 GMT


Aaron Davidson commented on SPARK-6962:

Thanks for those log excerpts. It is likely significant that each IP appeared exactly once
in a connection exception among the executors. Given this warning, but no corresponding error
"Still have X requests outstanding when connection from is closed", I also would
be inclined to deduce that only the TransportServer-side of the socket is timing out, and
that for some reason the connection exception is not reaching the client side of the socket
(which would have caused the outstanding fetch requests to fail promptly).

If this situation could arise, then each client could be waiting indefinitely for some other
server to respond, which it will not. Is your cluster in any sort of unusual network configuration?

Even so, this only could explain why the hang is indefinite, not why all communication is
paused for 20 minutes leading up to it.

To further diagnose this, it would actually be very useful if you could turn on TRACE level
debugging for and
(this should look like {{}} in the

> Netty BlockTransferService hangs in the middle of SQL query
> -----------------------------------------------------------
>                 Key: SPARK-6962
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Jon Chase
>         Attachments: jstacks.txt
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries
in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not
totally understood) circumstances.  
> This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point
to netty as the issue.  Netty was set as the default for the block transport layer in 1.2.0,
which is when this issue started.  Setting the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files) datasets.
 When I increase the scope to include more data (several hundred ~5MB files), the queries
will get through several steps but eventuall hang  indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> For context, here's the announcement regarding the block transfer service change:<>

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message