flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nico Kruber <n...@data-artisans.com>
Subject Re: Great number of jobs and numberOfBuffers
Date Fri, 18 Aug 2017 12:58:13 GMT
Hi Gwenhael,
the effect you describe sounds a bit strange. Just to clarify your setup:

1) Is the loop you were posting part of the application you run on yarn?
2) How many nodes are you running with?
3) What is the error you got when you tried to run the full program without 
splitting it?
4) can you give a rough sketch of what your program is composed of (operators, 
parallelism,...)? 


Nico

On Thursday, 17 August 2017 11:53:25 CEST Gwenhael Pasquiers wrote:
> Hello,
> 
> This bug was met in flink 1.0.1 over yarn (maybe the yarn behavior is
> different ?). We've been having this issue for a long time and we were
> careful not to schedule too many jobs.
 
> I'm currently upgrading the application towards flink 1.2.1 and I'd like to
> try to solve this issue.
 
> I'm not submitting individual jobs to a standalone cluster.
> 
> I'm starting a single application that has a loop in its main function :
> for(. . .) {
> 	Environment env = Environment.getExectionEnvironment();
> 	env. . . .;
> 	env.execute();
> }
> 
> 
> The job fails at some point later during execution with the following
> error:
 java.io.IOException: Insufficient number of network buffers:
> required 96, but only 35 available. The total number of network buffers is
> currently set to 36864. You can increase this number by setting the
> configuration key 'taskmanager.network.numberOfBuffers'. at
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPo
> ol(NetworkBufferPool.java:196) 
> Before splitting the job in multiple sub-jobs it failed right at startup.
> 
> Each "batch" job takes 10 to 30 minutes and it fails after about dozen of
> them (the first ones should have had enough time to be recycled).
 
> We've already increased the jobmanager and "numberOfBuffers" values quite a
> bit. That way we can handle days of data, but not weeks or months. This is
> not very scalable. And as you say, I felt that those buffers should be
> recycled and that way we should have no limit as long as each batch is
> small enough.
 
> If I start my command again (removing the datehours that were successfully
> processed) it will work since it's a fresh new cluster.
 
> -----Original Message-----
> From: Ufuk Celebi [mailto:uce@apache.org] 
> Sent: jeudi 17 août 2017 11:24
> To: Ufuk Celebi <uce@apache.org>
> Cc: Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>;
> user@flink.apache.org; Nico Kruber <nico@data-artisans.com>
 Subject: Re:
> Great number of jobs and numberOfBuffers
> 
> PS: Also pulling in Nico (CC'd) who is working on the network stack.
> 
> On Thu, Aug 17, 2017 at 11:23 AM, Ufuk Celebi <uce@apache.org> wrote:
> 
> > Hey Gwenhael,
> >
> >
> >
> > the network buffers are recycled automatically after a job terminates.
> > If this does not happen, it would be quite a major bug.
> >
> >
> >
> > To help debug this:
> >
> >
> >
> > - Which version of Flink are you using?
> > - Does the job fail immediately after submission or later during
> > execution?
 - Is the following correct: the batch job that eventually
> > fails
> > because of missing network buffers runs without problems if you submit 
> > it to a fresh cluster with the same memory
> >
> >
> >
> > The network buffers are recycled after the task managers report the 
> > task being finished. If you immediately submit the next batch there is 
> > a slight chance that the buffers are not recycled yet. As a possible 
> > temporary work around, could you try waiting for a short amount of 
> > time before submitting the next batch?
> >
> >
> >
> > I think we should also be able to run the job without splitting it up 
> > after increasing the network memory configuration. Did you already try 
> > this?
> >
> >
> >
> > Best,
> >
> >
> >
> > Ufuk
> >
> >
> >
> >
> > On Thu, Aug 17, 2017 at 10:38 AM, Gwenhael Pasquiers 
> > <gwenhael.pasquiers@ericsson.com> wrote:
> > 
> >> Hello,
> >>
> >>
> >>
> >>
> >>
> >> We’re meeting a limit with the numberOfBuffers.
> >>
> >>
> >>
> >>
> >>
> >> In a quite complex job we do a lot of operations, with a lot of 
> >> operators, on a lot of folders (datehours).
> >>
> >>
> >>
> >>
> >>
> >> In order to split the job into smaller “batches” (to limit the 
> >> necessary
> >> “numberOfBuffers”) I’ve done a loop over the batches (handle the 
> >> datehours 3 by 3), for each batch I create a new env then call the
> >> execute() method.
>>
> >>
> >>
> >>
> >>
> >> However it looks like there is no cleanup : after a while, if the 
> >> number of batches is too big, there is an error saying that the 
> >> numberOfBuffers isn’t high enough. It kinds of looks like some leak. 
> >> Is there a way to clean them up ?


Mime
View raw message