flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: No job recovery after job manager failure
Date Thu, 17 Dec 2015 14:26:08 GMT
As an update: I’m investigating this. Ali sent me the log files.

> On 16 Dec 2015, at 18:15, Ufuk Celebi <uce@apache.org> wrote:
> 
> Hey Ali,
> 
> can you send me the complete logs?
> 
> I don’t think it’s possible via the mailing list. Just send it to my private email
uce@apache.org.
> 
> – Ufuk
> 
>> On 16 Dec 2015, at 17:26, Kashmar, Ali <Ali.Kashmar@emc.com> wrote:
>> 
>> Hi,
>> 
>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started
a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work.
I then shut down the node that had the leader job manager, and by shut down I mean I powered
off the virtual machine running it. I monitored the logs to see what was going on and I saw
that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing
actually happens. Here’s the job manager log from the node that became the leader:
>> 
>> 11:06:43,448 INFO  org.apache.flink.runtime.jobmanager.JobManager               
- JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership
with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>> 11:06:45,912 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever      
- New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>> 11:06:45,963 INFO  org.apache.flink.runtime.instance.InstanceManager            
- Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager)
as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number
of alive task slots is 16.
>> 11:06:45,975 INFO  org.apache.flink.runtime.instance.InstanceManager            
- Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager)
as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number
of alive task slots is 32.
>> 11:08:25,925 INFO  org.apache.flink.runtime.jobmanager.JobManager               
- Recovering all jobs.
>> 
>> 
>> I waited 10 minutes after that last log and there was no change. And here’s the
task-manager log from the same node:
>> 
>> 
>> 11:06:45,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
- Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager
(attempt 1, timeout: 500 milliseconds)
>> 11:06:45,983 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
- Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager),
starting network stack and library cache.
>> 11:06:45,988 INFO  org.apache.flink.runtime.io.network.netty.NettyClient        
- Successful initialization (took 4 ms).
>> 11:06:45,994 INFO  org.apache.flink.runtime.io.network.netty.NettyServer        
- Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
>> 11:06:45,994 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
- Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
>> 11:06:45,995 INFO  org.apache.flink.runtime.blob.BlobCache                      
- Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>> 
>> 
>> Is this a bug?
>> 
>> Thanks,
>> Ali
> 


Mime
View raw message