flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Hogan <c...@greghogan.com>
Subject Re: Task Manager was lost/killed due to full GC
Date Fri, 15 Sep 2017 16:50:54 GMT
Late response, but a common reason for disappearing TaskManagers is termination by the Linux
out-of-memory killer, with the recommendation to decrease the allotted memory.


> On Sep 5, 2017, at 9:09 AM, ShB <shon.balakrishna@gmail.com> wrote:
> 
> Hi, 
> 
> I'm running a Flink batch job that reads almost 1 TB of data from S3 and
> then performs operations on it. A list of filenames are distributed among
> the TM's and each subset of files is read from S3 from each TM. This job
> errors out at the read step due to the following error:
> java.lang.Exception: TaskManager was lost/killed
> 
> Having read similar questions on the mailing list, it seems like this is a
> memory issue, with full GC at the TM causing the TM to be lost. 
> 
> After enabling memory debugging this seems to be the stats just before
> erroring out:
> Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB
> (used/committed/max)]
> Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory:
> 17148908
> Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)],
> [Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space:
> 5/5/1024 MB (used/committed/max)]
> Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC
> COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2]
> 
> I tried all of these suggested fixes: decreased taskmanager.memory.fraction
> to give more memory to user managed operations, increased number of
> JVM's(parallelism), used the G1 GC for better GC performance, but my job
> still errors out.  
> 
> I increased akka.watch.heartbeat.pause, akka.watch.threshold,
> akka.watch.heartbeat.interval to prevent the timeout due to GC. But this
> doesn't help either. I figured with the really high values for death watch,
> the program would run really slowly and complete at some point but it fails
> anyway. 
> 
> I'm now trying to decrease object creation in my program, but so far it
> hasn't helped.
> 
> How can I go about debugging and fixing this problem?
> 
> Thank you. 
> 
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Mime
View raw message