hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12958) PhantomReference for filesystem statistics can trigger OOM
Date Thu, 24 Mar 2016 20:46:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210928#comment-15210928
] 

Jason Lowe commented on HADOOP-12958:
-------------------------------------

Thanks for chiming in, [~sjlee0].  I ran the job again with the requested JVM options, and
here's the last few lines in the GC log.  The task ran with JDK 1.8.0_60-b27 on RHEL 6.5:
{noformat}
2016-03-24T20:10:36.476+0000: 107.002: [GC (Allocation Failure) 2016-03-24T20:10:36.481+0000:
107.007: [SoftReference, 0 refs, 0.0000264 secs]2016-03-24T20:10:36.481+0000: 107.007: [WeakReference,
0 refs, 0.0000130 secs]2016-03-24T20:10:36.481+0000: 107.007: [FinalReference, 0 refs, 0.0000127
secs]2016-03-24T20:10:36.481+0000: 107.007: [PhantomReference, 0 refs, 0 refs, 0.0000140 secs]2016-03-24T20:10:36.481+0000:
107.007: [JNI Weak Reference, 0.0000170 secs][PSYoungGen: 115296K->96K(115712K)] 603566K->488366K(734720K),
0.0061637 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2016-03-24T20:10:36.960+0000: 107.486: [GC (Allocation Failure) 2016-03-24T20:10:36.965+0000:
107.491: [SoftReference, 0 refs, 0.0000472 secs]2016-03-24T20:10:36.965+0000: 107.491: [WeakReference,
0 refs, 0.0000363 secs]2016-03-24T20:10:36.965+0000: 107.491: [FinalReference, 0 refs, 0.0000108
secs]2016-03-24T20:10:36.965+0000: 107.491: [PhantomReference, 0 refs, 0 refs, 0.0000083 secs]2016-03-24T20:10:36.965+0000:
107.491: [JNI Weak Reference, 0.0000091 secs][PSYoungGen: 115296K->96K(115712K)] 603566K->488366K(734720K),
0.0056482 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2016-03-24T20:10:38.447+0000: 108.973: [GC (Allocation Failure) 2016-03-24T20:10:38.452+0000:
108.978: [SoftReference, 0 refs, 0.0000338 secs]2016-03-24T20:10:38.452+0000: 108.978: [WeakReference,
0 refs, 0.0000123 secs]2016-03-24T20:10:38.452+0000: 108.978: [FinalReference, 3 refs, 0.0000158
secs]2016-03-24T20:10:38.452+0000: 108.978: [PhantomReference, 0 refs, 0 refs, 0.0000113 secs]2016-03-24T20:10:38.452+0000:
108.978: [JNI Weak Reference, 0.0000148 secs][PSYoungGen: 115296K->256K(115712K)] 603566K->488526K(734720K),
0.0058352 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2016-03-24T20:10:40.917+0000: 111.443: [GC (Allocation Failure) 2016-03-24T20:10:40.922+0000:
111.448: [SoftReference, 0 refs, 0.0000209 secs]2016-03-24T20:10:40.922+0000: 111.448: [WeakReference,
159 refs, 0.0000212 secs]2016-03-24T20:10:40.922+0000: 111.448: [FinalReference, 269 refs,
0.0003309 secs]2016-03-24T20:10:40.923+0000: 111.449: [PhantomReference, 0 refs, 6 refs, 0.0000108
secs]2016-03-24T20:10:40.923+0000: 111.449: [JNI Weak Reference, 0.0000108 secs][PSYoungGen:
115456K->480K(115712K)] 603726K->489305K(734720K), 0.0061493 secs] [Times: user=0.02
sys=0.00, real=0.01 secs] 
2016-03-24T20:10:42.797+0000: 113.323: [GC (Allocation Failure) 2016-03-24T20:10:42.802+0000:
113.328: [SoftReference, 0 refs, 0.0000230 secs]2016-03-24T20:10:42.802+0000: 113.328: [WeakReference,
231 refs, 0.0000225 secs]2016-03-24T20:10:42.802+0000: 113.328: [FinalReference, 350 refs,
0.0004041 secs]2016-03-24T20:10:42.803+0000: 113.329: [PhantomReference, 0 refs, 6 refs, 0.0000097
secs]2016-03-24T20:10:42.803+0000: 113.329: [JNI Weak Reference, 0.0000106 secs][PSYoungGen:
115680K->480K(114688K)] 604505K->489537K(733696K), 0.0062425 secs] [Times: user=0.02
sys=0.00, real=0.00 secs] 
2016-03-24T20:10:44.626+0000: 115.151: [GC (Allocation Failure) 2016-03-24T20:10:44.630+0000:
115.156: [SoftReference, 0 refs, 0.0000247 secs]2016-03-24T20:10:44.630+0000: 115.156: [WeakReference,
230 refs, 0.0000234 secs]2016-03-24T20:10:44.630+0000: 115.156: [FinalReference, 350 refs,
0.0003741 secs]2016-03-24T20:10:44.631+0000: 115.157: [PhantomReference, 0 refs, 6 refs, 0.0000095
secs]2016-03-24T20:10:44.631+0000: 115.157: [JNI Weak Reference, 0.0000104 secs][PSYoungGen:
114656K->896K(115200K)] 603713K->489961K(734208K), 0.0059075 secs] [Times: user=0.02
sys=0.00, real=0.00 secs] 
2016-03-24T20:10:46.449+0000: 116.975: [GC (Allocation Failure) 2016-03-24T20:10:46.455+0000:
116.981: [SoftReference, 0 refs, 0.0000285 secs]2016-03-24T20:10:46.455+0000: 116.981: [WeakReference,
235 refs, 0.0000293 secs]2016-03-24T20:10:46.455+0000: 116.981: [FinalReference, 356 refs,
0.0004249 secs]2016-03-24T20:10:46.455+0000: 116.981: [PhantomReference, 0 refs, 7 refs, 0.0000106
secs]2016-03-24T20:10:46.455+0000: 116.981: [JNI Weak Reference, 0.0000105 secs][PSYoungGen:
115072K->864K(114176K)] 604137K->489929K(733184K), 0.0069392 secs] [Times: user=0.02
sys=0.00, real=0.01 secs] 
2016-03-24T20:10:48.741+0000: 119.267: [GC (Allocation Failure) 2016-03-24T20:10:48.749+0000:
119.275: [SoftReference, 0 refs, 0.0000592 secs]2016-03-24T20:10:48.749+0000: 119.275: [WeakReference,
639 refs, 0.0000830 secs]2016-03-24T20:10:48.749+0000: 119.275: [FinalReference, 177 refs,
0.0002006 secs]2016-03-24T20:10:48.749+0000: 119.275: [PhantomReference, 0 refs, 12 refs,
0.0000099 secs]2016-03-24T20:10:48.749+0000: 119.275: [JNI Weak Reference, 0.0000109 secs][PSYoungGen:
114016K->1514K(114688K)] 603081K->497627K(733696K), 0.0088279 secs] [Times: user=0.04
sys=0.00, real=0.01 secs] 
2016-03-24T20:10:48.837+0000: 119.362: [GC (Allocation Failure) 2016-03-24T20:10:48.844+0000:
119.369: [SoftReference, 0 refs, 0.0000405 secs]2016-03-24T20:10:48.844+0000: 119.370: [WeakReference,
281 refs, 0.0000290 secs]2016-03-24T20:10:48.844+0000: 119.370: [FinalReference, 3 refs, 0.0000159
secs]2016-03-24T20:10:48.844+0000: 119.370: [PhantomReference, 0 refs, 1 refs, 0.0000076 secs]2016-03-24T20:10:48.844+0000:
119.370: [JNI Weak Reference, 0.0000102 secs][PSYoungGen: 10455K->1530K(109568K)] 506569K->497707K(1041920K),
0.0077403 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] 
2016-03-24T20:10:48.844+0000: 119.370: [GC (Allocation Failure) 2016-03-24T20:10:48.849+0000:
119.375: [SoftReference, 0 refs, 0.0000377 secs]2016-03-24T20:10:48.849+0000: 119.375: [WeakReference,
279 refs, 0.0000288 secs]2016-03-24T20:10:48.849+0000: 119.375: [FinalReference, 0 refs, 0.0000209
secs]2016-03-24T20:10:48.849+0000: 119.375: [PhantomReference, 0 refs, 1 refs, 0.0000113 secs]2016-03-24T20:10:48.849+0000:
119.375: [JNI Weak Reference, 0.0000132 secs][PSYoungGen: 1530K->1562K(112128K)] 497707K->497739K(1044480K),
0.0056076 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2016-03-24T20:10:48.850+0000: 119.376: [Full GC (Allocation Failure) 2016-03-24T20:10:48.857+0000:
119.383: [SoftReference, 0 refs, 0.0000476 secs]2016-03-24T20:10:48.857+0000: 119.383: [WeakReference,
1345 refs, 0.0000814 secs]2016-03-24T20:10:48.857+0000: 119.383: [FinalReference, 1006 refs,
0.0002261 secs]2016-03-24T20:10:48.857+0000: 119.383: [PhantomReference, 2 refs, 20 refs,
0.0000691 secs]2016-03-24T20:10:48.858+0000: 119.383: [JNI Weak Reference, 0.0000115 secs][PSYoungGen:
1562K->0K(112128K)] [ParOldGen: 496177K->490120K(627200K)] 497739K->490120K(739328K),
[Metaspace: 29185K->29185K(1075200K)], 0.1824223 secs] [Times: user=0.64 sys=0.00, real=0.18
secs] 
2016-03-24T20:10:49.033+0000: 119.559: [GC (Allocation Failure) 2016-03-24T20:10:49.038+0000:
119.564: [SoftReference, 0 refs, 0.0000332 secs]2016-03-24T20:10:49.038+0000: 119.564: [WeakReference,
0 refs, 0.0000123 secs]2016-03-24T20:10:49.038+0000: 119.564: [FinalReference, 0 refs, 0.0000103
secs]2016-03-24T20:10:49.038+0000: 119.564: [PhantomReference, 0 refs, 0 refs, 0.0000110 secs]2016-03-24T20:10:49.038+0000:
119.564: [JNI Weak Reference, 0.0000152 secs][PSYoungGen: 0K->0K(112128K)] 490120K->490120K(1044480K),
0.0057650 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2016-03-24T20:10:49.039+0000: 119.565: [Full GC (Allocation Failure) 2016-03-24T20:10:49.046+0000:
119.572: [SoftReference, 591 refs, 0.0001611 secs]2016-03-24T20:10:49.046+0000: 119.572: [WeakReference,
1218 refs, 0.0000570 secs]2016-03-24T20:10:49.046+0000: 119.572: [FinalReference, 691 refs,
0.0000371 secs]2016-03-24T20:10:49.046+0000: 119.572: [PhantomReference, 0 refs, 15 refs,
0.0000083 secs]2016-03-24T20:10:49.046+0000: 119.572: [JNI Weak Reference, 0.0000102 secs][PSYoungGen:
0K->0K(112128K)] [ParOldGen: 490120K->489358K(651776K)] 490120K->489358K(763904K),
[Metaspace: 29185K->29022K(1075200K)], 0.1861130 secs] [Times: user=0.64 sys=0.01, real=0.18
secs] 
Heap
 PSYoungGen      total 112128K, used 8888K [0x00000000f8e80000, 0x0000000100000000, 0x0000000100000000)
  eden space 108032K, 8% used [0x00000000f8e80000,0x00000000f972e390,0x00000000ff800000)
  from space 4096K, 0% used [0x00000000ffc00000,0x00000000ffc00000,0x0000000100000000)
  to   space 4096K, 0% used [0x00000000ff800000,0x00000000ff800000,0x00000000ffc00000)
 ParOldGen       total 932352K, used 489358K [0x00000000c0000000, 0x00000000f8e80000, 0x00000000f8e80000)
  object space 932352K, 52% used [0x00000000c0000000,0x00000000ddde3a80,0x00000000f8e80000)
 Metaspace       used 29118K, capacity 29346K, committed 29872K, reserved 1075200K
  class space    used 3320K, capacity 3374K, committed 3504K, reserved 1048576K
{noformat}

[~cmccabe]'s suggestion would probably fix the issue, since it hopefully triggers a GC cycle
before the big buffer comes along to force one.  Then we won't be in the middle of trying
to satisfy a huge allocation when we notice the phantom references are available for collection
for the first time and still need to do app-level processing to free it.  My main concern
with relying on that approach is tracking down other frameworks or apps that will need to
do similar things, since these kinds of jobs ran fine before.

> PhantomReference for filesystem statistics can trigger OOM
> ----------------------------------------------------------
>
>                 Key: HADOOP-12958
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12958
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.3, 2.6.4
>            Reporter: Jason Lowe
>             Fix For: 2.7.3, 2.6.5
>
>
> I saw an OOM that appears to have been caused by the phantom references introduced for
file system statistics management.  I'll post details in a followup comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message