crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Zimmerman (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-272) Unable to correlate crunch jobs within Oozie
Date Wed, 02 Oct 2013 20:13:42 GMT


Mike Zimmerman commented on CRUNCH-272:

Josh, I think that is a great first step.  Micah and I had a conversation offline about this
JIRA after I logged it and we walked through my use cases in more detail.  My targeted users
are system administrators and developers that are trying to monitor and tune oozie workflows
running on the Hadoop cluster.  The first part of the problem is figuring out a way to mark
which jobs are involved in a higher level operation like a crunch job launched through an
oozie workflow.  (Your suggestion may help do this.)  The second and more difficult part of
the problem is locating these marked jobs after the parent process has completed.  My first
thought is that it would be awesome if I could query the Job Tracker, by giving it a correlation
id and have it return a list of qualifying jobs.  I don't believe this is possible today and
the idea is also somewhat flawed by the fact that all data would be lost if the Job Tracker
instance was restarted.  My second thought is to harvest the information through log data,
but that seems like a lot of overhead and load on the cluster to do something that should
be relatively simple.  My final thought is to write custom code to log this information out
to a store that can be queried at the time the crunch job is executing.  Any recommendations
you have are very much appreciated.  I believe the solution to this problem probably lies
outside of the Crunch project, so if you need to close this issue please feel free to do so.

> Unable to correlate crunch jobs within Oozie
> --------------------------------------------
>                 Key: CRUNCH-272
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Mike Zimmerman
> I'm not really sure if this should be logged to Oozie or to Crunch, so please feel free
to move as needed.
> I would like to request a way to decorate map/reduce jobs that are spawned by a Crunch
pipeline so that I can programmatically determine their origin.  The primary use case for
this is integration with Oozie.  Oozie launches a single map job to run a java action (in
our case this java action runs a crunch job).  Traceability from this original "launcher"
job to the jobs created by the crunch job is impossible without trolling logs.  This leaves
a big black hole for the system operator to assess the performance/impact of these jobs. 
My initial thought was to provide a simple way to indicate a correlationId or similar on a
map/reduce job and then make it accessible within Oozie to query for.  Obviously, that request
would have to come after the correlation feature was available within map/reduce.

This message was sent by Atlassian JIRA

View raw message