flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andra Lungu <lungu.an...@gmail.com>
Subject Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)
Date Tue, 30 Jun 2015 13:54:28 GMT
Hey Till,

I managed to reproduce the bug; The logs are in the corresponding JIRA
[hopefully I got the right ones :)]: FLINK-2299
<https://issues.apache.org/jira/browse/FLINK-2299>

As a side line. Guys, these two issues (FLINK-2299
<https://issues.apache.org/jira/browse/FLINK-2299> and FLINK-2293
<https://issues.apache.org/jira/browse/FLINK-2293>) are pretty much show
stoppers for my work; this means that I can offer support to get them
fixed: i.e. modify parameter x on the cluster; try running it now etc.

And I would appreciate it if someone can look into them and pass me a
status update.

Thanks a lot! Sorry for the inconvenience!
Andra

On Tue, Jun 30, 2015 at 10:51 AM, Till Rohrmann <trohrmann@apache.org>
wrote:

> Do you have the JobManager and TaskManager logs of the corresponding TM,
> by any chance?
>
> On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <lungu.andra@gmail.com>
> wrote:
>
>> Something similar in flink-0.10-SNAPSHOT:
>>
>> 06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79))
>> -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to
>> FAILED
>> java.lang.Exception: The slot in which the task was executed has been
>> released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @
>> wally025 - 8 slots - URL: akka.tcp://
>> flink@130.149.249.35:56135/user/taskmanager
>>         at
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>         at
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>         at
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>         at
>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
>>         at
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>         at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
>>         at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>>         at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>         at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>         at
>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> 06/29/2015 10:33:46     Job execution switched to status FAILING.
>>
>>
>> On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <
>> alexander.s.alexandrov@gmail.com> wrote:
>>
>>> I witnessed a similar issue yesterday on a simple job (single task
>>> chain, no shuffles) with a release-0.9 based fork.
>>>
>>> 2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>
>>>> Yes , sorry for that..I found it somewhere in the logs..the problem was
>>>> that the program didn't die immediately but was somehow hanging and I
>>>> discovered the source of the problem only running the program on a subset
>>>> of the data.
>>>>
>>>> Thnks for the support,
>>>> Flavio
>>>>
>>>> On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <sewen@apache.org> wrote:
>>>>
>>>>> This means that the TaskManager was lost. The JobManager can no longer
>>>>> reach the TaskManager and consists all tasks executing ob the TaskManager
>>>>> as failed.
>>>>>
>>>>> Have a look at the TaskManager log, it should describe why the
>>>>> TaskManager failed.
>>>>> Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <pompermaier@okkam.it
>>>>> >:
>>>>>
>>>>>> Hi to all,
>>>>>>
>>>>>> I have this strange error in my job and I don't know what's going
on.
>>>>>> What can I do?
>>>>>>
>>>>>> The full exception is:
>>>>>>
>>>>>> The slot in which the task was scheduled has been killed (probably
>>>>>> loss of TaskManager).
>>>>>> at
>>>>>> org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
>>>>>> at
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
>>>>>> at
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
>>>>>> at
>>>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
>>>>>> at
>>>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
>>>>>> at
>>>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
>>>>>> at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
>>>>>> at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>> at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>> at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>> at
>>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
>>>>>> at
>>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
>>>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>> at
>>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>> at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>> at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>> at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>> at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>> at
>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message