flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7845) Netty Exception when submitting batch job repeatedly
Date Mon, 13 Nov 2017 14:06:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249627#comment-16249627
] 

Piotr Nowojski commented on FLINK-7845:
---------------------------------------

IllegalAccessError is irrelevant to any memory leak for 99.9% and I'm investigating it right
now.

Memory usage of your test is stable for me (please check the attached screenshot to the issue).
The only issue that I have seen is that after lots of iteration I got this error:

Caused by: java.io.IOException: Insufficient number of network buffers: required 8, but only
4 available. The total number of network buffers is currently set to 11519 of 32768 bytes
each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction',
'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
    at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:257)
    at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:199)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:618)
    at java.lang.Thread.run(Thread.java:748)

but it is almost for sure caused because of ever increasing job size. Each subsequent iteration
has more and more tasks, which is clearly visible in the logs.  I'm not sure if only the plan
between two execute() calls is executed (you can easily test it), however look at the following
lines in your code:

{code:java}
			if (entitonTuples == null) {
				entitonTuples = dsQuads;
			} else {
				entitonTuples = entitonTuples.union(dsQuads);
			}
{code}

after first iteration you are always making a union with previous iterations. I bet this is
the reason for growing job graph.

> Netty Exception when submitting batch job repeatedly
> ----------------------------------------------------
>
>                 Key: FLINK-7845
>                 URL: https://issues.apache.org/jira/browse/FLINK-7845
>             Project: Flink
>          Issue Type: Bug
>          Components: Core, Network
>    Affects Versions: 1.3.2
>            Reporter: Flavio Pompermaier
>         Attachments: Screen Shot 2017-11-13 at 14.54.38.png
>
>
> We had some problems with Flink and Netty so we wrote a small unit test to reproduce
the memory issues we have in production. It happens that we have to restart the Flink cluster
because the memory is always increasing from job to job. 
> The github project is https://github.com/okkam-it/flink-memory-leak and the JUnit test
is contained in the MemoryLeakTest class (within src/main/test).
> I don't know if this is the root of our problems but at some point, usually around the
28th loop, the job fails with the following exception (actually we never faced that in production
but maybe is related to the memory issue somehow...):
> {code:java}
> Caused by: java.lang.IllegalAccessError: org/apache/flink/runtime/io/network/netty/NettyMessage
> 	at io.netty.util.internal.__matchers__.org.apache.flink.runtime.io.network.netty.NettyMessageMatcher.match(NoOpTypeParameterMatcher.java)
> 	at io.netty.channel.SimpleChannelInboundHandler.acceptInboundMessage(SimpleChannelInboundHandler.java:95)
> 	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:102)
> 	... 16 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message