crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-272) Unable to correlate crunch jobs within Oozie
Date Fri, 27 Jun 2014 22:01:26 GMT


Micah Whitacre updated CRUNCH-272:

    Attachment: CRUNCH-272.patch

So here is a patch that extends the prototype but also includes a custom CrunchActionExecutor
+ schema for configuring Crunch in a workflow.  The schema is pretty much a copy of the Java
Action schema.

This is still probably a prototype vs something that will be merged immediately.

Some things to note with the patch:

* Oozie does not deploy endstates to a Maven repository (,
which means if we want to depend on Oozie we need to pick a distribution (rolled with Cloudera
only b/c that is what I had access to right now).  If we didn't own the CrunchActionExecutor
then we wouldn't need that dependency.
* There isn't a formal contract between the CrunchActionExecutor and what actually launches
the Crunch Pipeline(s).  They can use convenience methods to do the reporting or extend the
CrunchOozieLauncher for convenience.  This doesn't exactly align with most Oozie executors
which have a more formal contract is seems.  Do we want to start controlling how pipelines
are launched?

[~rkanter], this is obviously my first stab at a Crunch Action in Oozie.  if you have any
suggestions or alternate routes to go with this I'd be interested in hearing them.

> Unable to correlate crunch jobs within Oozie
> --------------------------------------------
>                 Key: CRUNCH-272
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Mike Zimmerman
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-272.patch, CRUNCH-272_prototype.patch
> I'm not really sure if this should be logged to Oozie or to Crunch, so please feel free
to move as needed.
> I would like to request a way to decorate map/reduce jobs that are spawned by a Crunch
pipeline so that I can programmatically determine their origin.  The primary use case for
this is integration with Oozie.  Oozie launches a single map job to run a java action (in
our case this java action runs a crunch job).  Traceability from this original "launcher"
job to the jobs created by the crunch job is impossible without trolling logs.  This leaves
a big black hole for the system operator to assess the performance/impact of these jobs. 
My initial thought was to provide a simple way to indicate a correlationId or similar on a
map/reduce job and then make it accessible within Oozie to query for.  Obviously, that request
would have to come after the correlation feature was available within map/reduce.

This message was sent by Atlassian JIRA

View raw message