crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Visualizing internal pipeline preparation stages
Date Tue, 01 Jul 2014 13:22:14 GMT
+1-- very cool. :)


On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Hey Christian,
>
> This looks awesome! There have been a bunch of times when I've been
> digging around in the planner and wanting to have something like this,
> so yes, I definitely think this is useful to have.
>
> - Gabriel
>
>
> On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> <christian.tzolov@gmail.com> wrote:
> > Hi,
> >
> > While exploring the Crunch MR execution flow I decided to augment the
> > excellent pipeline DOT diagram with few additional visualizations of some
> > interesting (for me) internal/intermediate pipeline preparation states.
> > Such like the output-pcollection-targets structure (used for the pipeline
> > planning), the Graphs before and after the split up of dependent GBK
> nodes
> > and the RTNode hierarchy as persistent in the Configuration before the
> > execution of the pipeline.
> > For each diagram I've plotted some relevant internals like the PTypte
> > structures. The implementation hack includes 3 additional DotfileWriters
> > hooked inside the MSCRPlanner#plan() to intercept the flow.
> >
> > An example of the diagrams generated from the
> > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> below.
> >
> > Do we need such internals visualization? Something like visualization of
> > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> > preparation?  What do you think?
> >
> > Cheers,
> > Christian
> >
> >
> > Diagrams generated from the
> > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> >
> > - Dotfile containing all graphs:
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> >
> >
> > 1.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > - is the existing diagram. It provides very well balanced view of the
> > pipeline, showing how the functional blocks are mapped into execution
> > Map/Reduce components and the dependencies between them.
> >
> > 2.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>>
outputs)
> in
> > the MSCRPlanner on plan() operation is execution:
> > - Each data flow is depicted with different color to indicate the
> > overlapping execution paths.
> > - The PCollection name, class and PTypes are shown.
> >
> > 3.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method.
> It
> > draws the Vertices with their names, pcollection and ptype. The arc label
> > lists the Graph's edge path lists.
> >
> > 4.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > - Graph created in the MSCRPlanner#plan() after the splits up of
> dependent
> > GBK nodes and break the graph up into connected components - bounded by
> > read dashed line.
> >
> > 5.
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer
> as
> > well as the Inputs and Outputs.
> > - RTNodes are  deserialized from the Job's
> > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to
> > the containing Map or Reduce tasks and parent Crunch Job. The
> relationship
> > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String,
> > OutputConfig> and depicted in the magenta subgraph
> > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> > Map<Integer, List<Path>>> and depicted in green subgraph
> > - The inputs are mapped to the corresponding RTNode using the nodeIndex
> > reference.
> > - Outputs are mapped to the corresponding RTNode by the Output Name
> > references
> > - There is not good way to print the anonymous DoFn instances.
> > - Note: the dependency between the crunch jobs is not drawn as it my
> > require access to the competition hook attributes.
> > - Note: in order to draw the RTNodes i had to expose its attributes via
> > public getters.
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message