spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niranda Perera <niranda.per...@gmail.com>
Subject Re: Possible deadlock in registering applications in the recovery mode
Date Fri, 22 Apr 2016 06:12:14 GMT
Hi guys,

any update on this?

Best

On Wed, Apr 20, 2016 at 3:00 AM, Niranda Perera <niranda.perera@gmail.com>
wrote:

> Hi Reynold,
>
> I have created a JIRA for this [1]. I have also created a PR for the same
> issue [2].
>
> Would be very grateful if you could look into this, because this is a
> blocker in our spark deployment, which uses number of spark custom
> extension.
>
> thanks
> best
>
> [1] https://issues.apache.org/jira/browse/SPARK-14736
> [2] https://github.com/apache/spark/pull/12506
>
> On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <rxin@databricks.com> wrote:
>
>> I haven't looked closely at this, but I think your proposal makes sense.
>>
>>
>> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <niranda.perera@gmail.com
>> > wrote:
>>
>>> Hi guys,
>>>
>>> Any update on this?
>>>
>>> Best
>>>
>>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <
>>> niranda.perera@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have encountered a small issue in the standalone recovery mode.
>>>>
>>>> Let's say there was an application A running in the cluster. Due to
>>>> some issue, the entire cluster, together with the application A goes down.
>>>>
>>>> Then later on, cluster comes back online, and the master then goes into
>>>> the 'recovering' mode, because it sees some apps, workers and drivers have
>>>> already been in the cluster from Persistence Engine. While in the recovery
>>>> process, the application comes back online, but now it would have a
>>>> different ID, let's say B.
>>>>
>>>> But then, as per the master, application registration logic, this
>>>> application B will NOT be added to the 'waitingApps' with the message
>>>> ""Attempted to re-register application at same address". [1]
>>>>
>>>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>>>     val appAddress = app.driver.address
>>>>     if (addressToApp.contains(appAddress)) {
>>>>       logInfo("Attempted to re-register application at same address: "
>>>> + appAddress)
>>>>       return
>>>>     }
>>>>
>>>>
>>>> The problem here is, master is trying to recover application A, which
>>>> is not in there anymore. Therefore after the recovery process, app A will
>>>> be dropped. However app A's successor, app B was also omitted from the
>>>> 'waitingApps' list because it had the same address as App A previously.
>>>>
>>>> This creates a deadlock in the cluster, app A nor app B is available in
>>>> the cluster.
>>>>
>>>> When the master is in the RECOVERING mode, shouldn't it add all the
>>>> registering apps to a list first, and then after the recovery is completed
>>>> (once the unsuccessful recoveries are removed), deploy the apps which are
>>>> new?
>>>>
>>>> This would sort this deadlock IMO?
>>>>
>>>> look forward to hearing from you.
>>>>
>>>> best
>>>>
>>>> [1]
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>>>
>>>> --
>>>> Niranda
>>>> @n1r44 <https://twitter.com/N1R44>
>>>> +94-71-554-8430
>>>> https://pythagoreanscript.wordpress.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Niranda
>>> @n1r44 <https://twitter.com/N1R44>
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>



-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Mime
View raw message