crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Quinn <>
Subject Re: Crunch API to run code at JVM startup / shutdown
Date Thu, 20 Aug 2015 18:19:31 GMT
Currently we just started profiling our Crunch pipeline using the builtin
-Xprof option, and its been pretty hard to interpret. A more robust
profiling option built in to Crunch that could tell us some DoFn-granular
statistics would definitely be useful. In general very curious to hear what
Crunch users do about profiling (specifically using the MRPipeline).

On Thu, Aug 20, 2015 at 3:34 AM, Clément MATHIEU <>

> Hi,
> I am trying to setup something to automatically profile my Crunch jobs on
> an Hadoop cluster.
> I have been a long time user of hprof & "mapred.task.profile" because it
> is so easy to use on Hadoop. However, I am now moving away from it:
>  - will be removed from Java 9
>  - suffers from safe point bias
>  - does not allow to profile native code
>  - gathering other metrics than stack trace samples can be useful
> I had like to replace hprof by Flight Recorder and/or perf. Unlike hprof,
> both need to be started and stopped programmatically since there is not
> glue for them in Hadoop. I can see three options:
> 1. Hack the app
> It can be done using DoFn.initialize/cleanup. Or all DoFns invoke the same
> idempotent code, or dedicated DoFns are inserted at specific points. Both
> seems horrific and disgusting :)
> 2. Java agent
> Profiling is not tied to Crunch and any tool can be profiled. Main
> drawbacks are that the agent must be deployed on all the nodes and that it
> does not have easy access to metadata like user, job name, stage etc.
> A good example of such agent is statsd-jvm-profiler, see
> They even have a small
> bridge to push Cascading metadata to the agent, see
> .
> 3. Dedicated Crunch API
> Some code needs to be executed on JVM startup / shutdown. AFAIK it is not
> currently possible but could be added (however I am not sure how to
> implement it on Spark). Unlike a javaagent, it does not require to deploy
> something on the nodes, metadata can be pushed to the services (ie. ctx)
> and it is more flexible.
> I believe that allowing users to easily run code at JVM startup / shutdown
> would be an useful improvement. Any opinion ?
> Clément MATHIEU

*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

View raw message