flink-user-zh mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han Xiao <xiao...@chinaunicom.cn>
Subject Re: Re: flink ha模式进程hang!!!
Date Tue, 26 Mar 2019 05:53:01 GMT
非常谢谢您的解答,这个问题是zk中有失败任务的jobGraph,导致每次启动群集就会去检索,删除zk中残余后重启即可解决。

 
Thank you for your reply!
发件人: baiyg25281@hundsun.com
发送时间: 2019-03-26 09:40
收件人: user-zh
主题: Re: Re: flink ha模式进程hang!!!
是不是跟这个访问控制有关?
high-availability.zookeeper.client.acl: open
 
 
 
baiyg25281@hundsun.com
发件人: Han Xiao
发送时间: 2019-03-26 09:33
收件人: user-zh@flink.apache.org
主题: Re: Re: flink ha模式进程hang!!!
Hi,早上好,谢谢您的回复,以下是我的配置项及参数:
flink-conf.yaml
common:
jobmanager.rpc.address: test10
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2
taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp
High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://test10:8020/flink/ha/   ##此文件目录可以正常生成,但无jobGraph相关目录;
high-availability.zookeeper.quorum: ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181
high-availability.zookeeper.client.acl: open
Fault tolerance and checkpointing:
state.backend: filesystem
state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints  ##此目录没有生成;
Web Frontend:
rest.port: 8081
masters:                                     slaves:
test10:8081                                   test12
test11 : 8082                                    test13
                                                         test14
以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。
Thank you for your reply!
发件人: Zili Chen
发送时间: 2019-03-25 19:57
收件人: user-zh@flink.apache.org
主题: Re: flink ha模式进程hang!!!
看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找
submittedJobGraph,这个看起来就不太对。
Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把
high-availability.storageDir
配成一个有效的 HDFS 路径
Best,
tison.
Zili Chen <wander4096@gmail.com> 于2019年3月25日周一 下午7:53写道:
> 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊
> Best,
> tison.
>
>
> Han Xiao <xiaoh20@chinaunicom.cn> 于2019年3月25日周一 下午7:26写道:
>
>>         各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下;
>> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误
 File does
>> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph
,他会一直去检索
>> /a5ffe00b0bc5688d9a7de5c62b8150e6
>> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph
>> ,让我的ha群集健康的运行起来。
>>
>>
>> 报错日志:
>> 2019-03-25 18:55:00,742 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>> not retrieve submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>         at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>         at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>> .......
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>>         at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>         at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>> ........
>> Caused by: java.io.FileNotFoundException: File does not exist:
>> /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
>> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070)
>> .......
>>
>> 谢谢!
>>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message