crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Tzolov <>
Subject Visualizing internal pipeline preparation stages
Date Tue, 01 Jul 2014 12:16:37 GMT

While exploring the Crunch MR execution flow I decided to augment the
excellent pipeline DOT diagram with few additional visualizations of some
interesting (for me) internal/intermediate pipeline preparation states.
Such like the output-pcollection-targets structure (used for the pipeline
planning), the Graphs before and after the split up of dependent GBK nodes
and the RTNode hierarchy as persistent in the Configuration before the
execution of the pipeline.
For each diagram I've plotted some relevant internals like the PTypte
structures. The implementation hack includes 3 additional DotfileWriters
hooked inside the MSCRPlanner#plan() to intercept the flow.

An example of the diagrams generated from the
org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below.

Do we need such internals visualization? Something like visualization of
the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
preparation?  What do you think?


Diagrams generated from the
org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.

- Dotfile containing all graphs:

- is the existing diagram. It provides very well balanced view of the
pipeline, showing how the functional blocks are mapped into execution
Map/Reduce components and the dependencies between them.

- Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs)
the MSCRPlanner on plan() operation is execution:
- Each data flow is depicted with different color to indicate the
overlapping execution paths.
- The PCollection name, class and PTypes are shown.

- Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It
draws the Vertices with their names, pcollection and ptype. The arc label
lists the Graph's edge path lists.

- Graph created in the MSCRPlanner#plan() after the splits up of dependent
GBK nodes and break the graph up into connected components - bounded by
read dashed line.

- Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as
well as the Inputs and Outputs.
- RTNodes are  deserialized from the Job's
the containing Map or Reduce tasks and parent Crunch Job. The relationship
between RTNodes (e.g. parent/children)  is depicted with arrows.
- Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
OutputConfig> and depicted in the magenta subgraph
- Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
Map<Integer, List<Path>>> and depicted in green subgraph
- The inputs are mapped to the corresponding RTNode using the nodeIndex
- Outputs are mapped to the corresponding RTNode by the Output Name
- There is not good way to print the anonymous DoFn instances.
- Note: the dependency between the crunch jobs is not drawn as it my
require access to the competition hook attributes.
- Note: in order to draw the RTNodes i had to expose its attributes via
public getters.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message