crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Tzolov <christian.tzo...@gmail.com>
Subject Re: Visualizing internal pipeline preparation stages
Date Wed, 02 Jul 2014 17:21:21 GMT
cool :) What is the best way to continue? open a new Jira ticket for it?


On Tue, Jul 1, 2014 at 3:22 PM, Josh Wills <jwills@cloudera.com> wrote:

> +1-- very cool. :)
>
>
> On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
>
> > Hey Christian,
> >
> > This looks awesome! There have been a bunch of times when I've been
> > digging around in the planner and wanting to have something like this,
> > so yes, I definitely think this is useful to have.
> >
> > - Gabriel
> >
> >
> > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> > <christian.tzolov@gmail.com> wrote:
> > > Hi,
> > >
> > > While exploring the Crunch MR execution flow I decided to augment the
> > > excellent pipeline DOT diagram with few additional visualizations of
> some
> > > interesting (for me) internal/intermediate pipeline preparation states.
> > > Such like the output-pcollection-targets structure (used for the
> pipeline
> > > planning), the Graphs before and after the split up of dependent GBK
> > nodes
> > > and the RTNode hierarchy as persistent in the Configuration before the
> > > execution of the pipeline.
> > > For each diagram I've plotted some relevant internals like the PTypte
> > > structures. The implementation hack includes 3 additional
> DotfileWriters
> > > hooked inside the MSCRPlanner#plan() to intercept the flow.
> > >
> > > An example of the diagrams generated from the
> > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> > below.
> > >
> > > Do we need such internals visualization? Something like visualization
> of
> > > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline
> > > preparation?  What do you think?
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > Diagrams generated from the
> > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> > >
> > > - Dotfile containing all graphs:
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> > >
> > >
> > > 1.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > > - is the existing diagram. It provides very well balanced view of the
> > > pipeline, showing how the functional blocks are mapped into execution
> > > Map/Reduce components and the dependencies between them.
> > >
> > > 2.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>>
outputs)
> > in
> > > the MSCRPlanner on plan() operation is execution:
> > > - Each data flow is depicted with different color to indicate the
> > > overlapping execution paths.
> > > - The PCollection name, class and PTypes are shown.
> > >
> > > 3.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method.
> > It
> > > draws the Vertices with their names, pcollection and ptype. The arc
> label
> > > lists the Graph's edge path lists.
> > >
> > > 4.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > > - Graph created in the MSCRPlanner#plan() after the splits up of
> > dependent
> > > GBK nodes and break the graph up into connected components - bounded by
> > > read dashed line.
> > >
> > > 5.
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > > - Visualizes the RTNodes ussed inside the CrunchMapper and
> CrunchReducer
> > as
> > > well as the Inputs and Outputs.
> > > - RTNodes are  deserialized from the Job's
> > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped
> to
> > > the containing Map or Reduce tasks and parent Crunch Job. The
> > relationship
> > > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into
> Map<String,
> > > OutputConfig> and depicted in the magenta subgraph
> > > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle,
> > > Map<Integer, List<Path>>> and depicted in green subgraph
> > > - The inputs are mapped to the corresponding RTNode using the nodeIndex
> > > reference.
> > > - Outputs are mapped to the corresponding RTNode by the Output Name
> > > references
> > > - There is not good way to print the anonymous DoFn instances.
> > > - Note: the dependency between the crunch jobs is not drawn as it my
> > > require access to the competition hook attributes.
> > > - Note: in order to draw the RTNodes i had to expose its attributes via
> > > public getters.
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message