flink-user-zh mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shao.hongxiao" <17611022...@163.com>
Subject 回复: flink on kubernetes 作业卡主现象咨询
Date Tue, 12 May 2020 07:39:54 GMT
兄弟 请问你的问题解决了吗?怎么解决的,谢谢


| |
邵红晓
|
|
邮箱:17611022895@163.com
|
签名由网易邮箱大师定制
在2020年5月8日 10:08,LakeShen<shenleifighting@gmail.com> 写道:
Hi ,

你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd
bufffers 不够。

具体文档参考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html

Best,
LakeShen

a511955993 <a511955993@163.com> 于2020年5月7日周四 下午9:54写道:

hi, all


集群信息:
flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03,
cni使用的是weave。


现象:
作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。


通过jstack出来的堆栈信息片段如下:


"Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f
waiting on condition [0x00007f66b04ed000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0000000608f3c600> (a
java.util.concurrent.CompletableFuture$Signaller)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.runtime.io
.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
at org.apache.flink.runtime.io
.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
at org.apache.flink.runtime.io
.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
at org.apache.flink.runtime.io
.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
at org.apache.flink.runtime.io
.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
at org.apache.flink.runtime.io
.network.api.writer.RecordWriter.emit(RecordWriter.java:116)
at org.apache.flink.runtime.io
.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)




有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
taskmanager.network.request-backoff.max: 300000
akka.ask.timeout: 120s
akka.watch.heartbeat.interval: 10s


尝试过调整buffer数量,不见效:
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.memory.buffers-per-channel: 6




目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message