flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com>
Subject RE: Great number of jobs and numberOfBuffers
Date Thu, 17 Aug 2017 09:53:25 GMT

This bug was met in flink 1.0.1 over yarn (maybe the yarn behavior is different ?). We've
been having this issue for a long time and we were careful not to schedule too many jobs.

I'm currently upgrading the application towards flink 1.2.1 and I'd like to try to solve this

I'm not submitting individual jobs to a standalone cluster.

I'm starting a single application that has a loop in its main function :
for(. . .) {
	Environment env = Environment.getExectionEnvironment();
	env. . . .;

The job fails at some point later during execution with the following error:
java.io.IOException: Insufficient number of network buffers: required 96, but only 35 available.
The total number of network buffers is currently set to 36864. You can increase this number
by setting the configuration key 'taskmanager.network.numberOfBuffers'.
  at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:196)

Before splitting the job in multiple sub-jobs it failed right at startup.

Each "batch" job takes 10 to 30 minutes and it fails after about dozen of them (the first
ones should have had enough time to be recycled).

We've already increased the jobmanager and "numberOfBuffers" values quite a bit. That way
we can handle days of data, but not weeks or months. This is not very scalable. And as you
say, I felt that those buffers should be recycled and that way we should have no limit as
long as each batch is small enough.

If I start my command again (removing the datehours that were successfully processed) it will
work since it's a fresh new cluster.

-----Original Message-----
From: Ufuk Celebi [mailto:uce@apache.org] 
Sent: jeudi 17 août 2017 11:24
To: Ufuk Celebi <uce@apache.org>
Cc: Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>; user@flink.apache.org; Nico
Kruber <nico@data-artisans.com>
Subject: Re: Great number of jobs and numberOfBuffers

PS: Also pulling in Nico (CC'd) who is working on the network stack.

On Thu, Aug 17, 2017 at 11:23 AM, Ufuk Celebi <uce@apache.org> wrote:
> Hey Gwenhael,
> the network buffers are recycled automatically after a job terminates.
> If this does not happen, it would be quite a major bug.
> To help debug this:
> - Which version of Flink are you using?
> - Does the job fail immediately after submission or later during execution?
> - Is the following correct: the batch job that eventually fails 
> because of missing network buffers runs without problems if you submit 
> it to a fresh cluster with the same memory
> The network buffers are recycled after the task managers report the 
> task being finished. If you immediately submit the next batch there is 
> a slight chance that the buffers are not recycled yet. As a possible 
> temporary work around, could you try waiting for a short amount of 
> time before submitting the next batch?
> I think we should also be able to run the job without splitting it up 
> after increasing the network memory configuration. Did you already try 
> this?
> Best,
> Ufuk
> On Thu, Aug 17, 2017 at 10:38 AM, Gwenhael Pasquiers 
> <gwenhael.pasquiers@ericsson.com> wrote:
>> Hello,
>> We’re meeting a limit with the numberOfBuffers.
>> In a quite complex job we do a lot of operations, with a lot of 
>> operators, on a lot of folders (datehours).
>> In order to split the job into smaller “batches” (to limit the 
>> necessary
>> “numberOfBuffers”) I’ve done a loop over the batches (handle the 
>> datehours 3 by 3), for each batch I create a new env then call the execute() method.
>> However it looks like there is no cleanup : after a while, if the 
>> number of batches is too big, there is an error saying that the 
>> numberOfBuffers isn’t high enough. It kinds of looks like some leak. 
>> Is there a way to clean them up ?
View raw message