flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1604) Livelock in PartitionRequestClientFactory
Date Tue, 24 Feb 2015 09:47:04 GMT
Till Rohrmann created FLINK-1604:

             Summary: Livelock in PartitionRequestClientFactory
                 Key: FLINK-1604
                 URL: https://issues.apache.org/jira/browse/FLINK-1604
             Project: Flink
          Issue Type: Bug
            Reporter: Till Rohrmann

In case of a job restart, we observed a livelock in {{PartitionRequestClientFactory.createPartitionRequestClient}}.
We suspect that this might have the following reason:

In order to obtain a new {{PartitionRequestClient}} a new {{ConnectingChannel}} is created.
This channel acts as a future for the client. The channel is inserted into a {{ConcurrentHashMap}}
so that other {{Threads}} trying to create a client for the same address wait on the future.
Once the client is obtained by the initially requesting {{Thread}}, it is inserted into the
{{HashMap}} instead of the {{ConnectionChannel}}. When the client is disposed, then it will
be removed from the {{HashMap}}, but only if the client is still stored in the map. 

And here is where things can go wrong. If the requesting thread is interrupted after it created
the {{ConnectingChannel}} and inserted it into the {{ConcurrentHashMap}} but before inserting
the {{PartitionRequestClient}} into the same map, then a the map entry for a given {{RemoteAddress}}
is the {{ConnectingChannel}}. Assume now that another thread waited at this channel and eventually
obtained the client from this future. In the wake of cancelling the job, the client would
be disposed by the corresponding {{RemoteInputChannel}}. Once the job has been restarted,
new threads want to connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}}
with the disposed {{PartitionRequestClient}} as future result in the hash map. They retrieve
the channel and see that the client has already been disposed. Now they try to delete the
client from the {{ConcurrentHashMap}} to make room for a new one. However, this deletion fails,
because the map still contains the {{ConnectingChannel}}.

That is currently our best theory for the livelock we observed on Travis.

This message was sent by Atlassian JIRA

View raw message