hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil Suits <da...@vast.com>
Subject Interrupted YourKit Snapshots
Date Wed, 05 Aug 2015 17:24:20 GMT
Summary: why does the JVM for my (successful) map task exit before the
yourkit agent has written the snapshot?


I'm trying to use YourKit[1] to profile map reduce jobs running in yarn.
In a similar exercise running on MRv1, this was straight forward (
-agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=sampling,onexit=snapshot....
).

When I try the same thing in our MRv2 environment [2], the yourkit agent
logs suggest that the process is being interrupted before the snapshot can
be generated.

For example, in this case, an "empty" snapshot is created -- the snapshot
file is available, and readable, but does not contain any telemetry data.

15064 67.067: SP: pin java.lang.InterruptedException
15064 67.092: f_t_a_d: #56 trn=17 conv=0/1011 ms=1
15064 67.198: vmDeath: 1
15064 67.198: vmDeath: 2
15064 67.198: vmDeath: 3
15064 67.258: [m: 3927 1163] WSN: starting...
<eof>

But the problem is clearly a race, as I don't always see the WSN messages
start.  For instance:

15064 73.234: f_t_a_d: #50 trn=71 conv=0/691 ms=1
15064 74.620: SP: pin java.lang.InterruptedException
15064 74.647: f_t_a_d: #56 trn=17 conv=0/1352 ms=1
15064 74.752: vmDeath: 1
15064 74.752: vmDeath: 2
15064 74.752: vmDeath: 3
<eof>

In this case, there was no snapshot generated at all.

These two snippets both represent my attempts to run a trivial M/R job
hadoop --config ~/hadoop-performance jar ...[3]

The yourkit agent reports that it is running as

15064 0.000: Command line:
  /usr/lib/jvm/java-7-oracle-cloudera/bin/java
  -Djava.net.preferIPv4Stack=true
  -Dhadoop.metrics.log.level=WARN
  -Djava.net.preferIPv4Stack=true
  -Xmx3435973837

-Djava.io.tmpdir=/opt/yarn/nm/usercache/analytics/appcache/application_1426523383307_4394/container_e18_1426523383307_4394_01_000002/tmp
  -Dlog4j.configuration=container-log4j.properties

-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1426523383307_4394/container_e18_1426523383307_4394_01_000002
  -Dyarn.app.container.log.filesize=0
  -Dhadoop.root.logger=INFO,CLA

-agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=sampling,onexit=snapshot,dir=/var/log/hadoop-yarn/yjp/pnq.1550/snapshot,logdir=/var/log/hadoop-yarn/yjp/pnq.1550/logDir
  org.apache.hadoop.mapred.YarnChild
  172.16.39.75
  37815
  attempt_1426523383307_4394_m_000000_0
  2

When we try to profile Spark applications, we find the snapshot as I
expected.  The yourkit log in that instance looks like


15064 22.900: vmDeath: 1
15064 22.900: vmDeath: 2
15064 22.900: vmDeath: 3
15064 23.019: [m: 8207 959] WSN: starting...
15064 23.680: [m: 8207 960] WSN: tables written
15064 23.680: [m: 8207 960] WSN: heap segment closed
15064 23.681: [m: 8207 960] WSN: map done
15064 23.726: SP: gc #0 28 ms=45 sz=36370 p=20 t=125
15064 23.726: [m: 8207 960] WSN: done
15064 23.727: Snapshot captured:
/var/log/hadoop-yarn/yjp/pnq.6666/CoarseGrainedExecutorBackend-2015-08-04-shutdown-3.snapshot
15064 23.728: vmDeath: 4
15064 23.728: vmDeath: OK

Having been run as the command:

15064 0.001: Command line:
  /usr/lib/jvm/java-7-oracle-cloudera/bin/java
  -server
  -XX:OnOutOfMemoryError=kill %p
  -Xms7192m
  -Xmx7192m

-agentpath:/opt/yjp/bin/linux-x86-64/libyjpagent.so=tracing,onexit=snapshot,dir=/var/log/hadoop-yarn/yjp/pnq.6666,logdir=/var/log/hadoop-yarn/yjp/logDir

-Djava.io.tmpdir=/opt/yarn/nm/usercache/rshah/appcache/application_1426523383307_4409/container_e18_1426523383307_4409_01_000002/tmp

-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1426523383307_4409/container_e18_1426523383307_4409_01_000002
  org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://
sparkDriver@beta-hadoop5.oak.vast.com:53957/user/CoarseGrainedScheduler
  4
  beta-hadoop1.oak.vast.com
  8
  application_1426523383307_4409


I've also verified, by connecting to the yourkit agent while the map task
is running, that the telemetry is available.

I did find this question in stackoverflow, which describes some parameters
that might help with the race during shutdown
http://stackoverflow.com/questions/25983999/hadoop-hprof-profiling-no-cpu-samples-written

However, using these parameters have not improved my results; in fact, I
can't detect any change to the outcome at all -- as far as I can tell, the
code path I'm on is not impacted by changing these settings.

Has anybody documented this investigation before me?  Pointers?
Suggestions?  Sympathy?  I'm not stuck yet, but I've definitely reached the
point in this investigation where daily progress is purely hypothetical.

Thanks,
Danil

[1] YourKit Java Profiler 2015 build 15064
[2] 2.5.0-cdh5.3.0 -- I don't have any reason at this point to assume this
is specific to this version.
[3] ~/hadoop-performance is my copy of our default configuration, with
minor(?) changes to specify properties to enable the profiling.

Mime
View raw message