crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Visualizing internal pipeline preparation stages
Date Tue, 01 Jul 2014 12:28:09 GMT
Hey Christian,

This looks awesome! There have been a bunch of times when I've been
digging around in the planner and wanting to have something like this,
so yes, I definitely think this is useful to have.

- Gabriel

On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
<> wrote:
> Hi,
> While exploring the Crunch MR execution flow I decided to augment the
> excellent pipeline DOT diagram with few additional visualizations of some
> interesting (for me) internal/intermediate pipeline preparation states.
> Such like the output-pcollection-targets structure (used for the pipeline
> planning), the Graphs before and after the split up of dependent GBK nodes
> and the RTNode hierarchy as persistent in the Configuration before the
> execution of the pipeline.
> For each diagram I've plotted some relevant internals like the PTypte
> structures. The implementation hack includes 3 additional DotfileWriters
> hooked inside the MSCRPlanner#plan() to intercept the flow.
> An example of the diagrams generated from the
> org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below.
> Do we need such internals visualization? Something like visualization of
> the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> preparation?  What do you think?
> Cheers,
> Christian
> Diagrams generated from the
> org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> - Dotfile containing all graphs:
> 1.
> - is the existing diagram. It provides very well balanced view of the
> pipeline, showing how the functional blocks are mapped into execution
> Map/Reduce components and the dependencies between them.
> 2.
> - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs)
> the MSCRPlanner on plan() operation is execution:
> - Each data flow is depicted with different color to indicate the
> overlapping execution paths.
> - The PCollection name, class and PTypes are shown.
> 3.
> - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It
> draws the Vertices with their names, pcollection and ptype. The arc label
> lists the Graph's edge path lists.
> 4.
> - Graph created in the MSCRPlanner#plan() after the splits up of dependent
> GBK nodes and break the graph up into connected components - bounded by
> read dashed line.
> 5.
> - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as
> well as the Inputs and Outputs.
> - RTNodes are  deserialized from the Job's
> the containing Map or Reduce tasks and parent Crunch Job. The relationship
> between RTNodes (e.g. parent/children)  is depicted with arrows.
> - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
> OutputConfig> and depicted in the magenta subgraph
> - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> Map<Integer, List<Path>>> and depicted in green subgraph
> - The inputs are mapped to the corresponding RTNode using the nodeIndex
> reference.
> - Outputs are mapped to the corresponding RTNode by the Output Name
> references
> - There is not good way to print the anonymous DoFn instances.
> - Note: the dependency between the crunch jobs is not drawn as it my
> require access to the competition hook attributes.
> - Note: in order to draw the RTNodes i had to expose its attributes via
> public getters.

View raw message