In my dev clusters, there are 2 NodeManagers. It is crashing often for the past few weeks because of memory issues and tried to increase heap size for Node Manager process as temporary workaround (from 512 MB to 6 GB as of now). Before 2 days, it couldn't even able to start with 4 GB after crash and it worked after increasing it to 6 GB. I am using Cloudera Manager free version and attached graph (jvm_heap_size.png) took from yarn monitoring showing the heap usage across node managers (metric name is jvm_heap_used_mb_across_nodemanagers).
Also, I am seeing correlation with GC process as well. Attached graph (jvm_gc_time_ms_rate_across_nodemanagers.png) shows the same (metric name is jvm_gc_time_ms_rate_across_nodemanagers)
While analysing heap dump for killed jvm of NodeManager process, come to know that DeletionService.java (Hash Map) is taking huge amount of memory for some reasons.
2) Yarn JHS is showing only 2OK entries (jobs) although yarn log aggregation has been enabled and configured retain seconds as 30 days. But, I can able to see the logs of old jobs using yarn cli.
I am using yarn 2.6.0. Can you help me in this regard?