crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Visualizing internal pipeline preparation stages
Date Wed, 02 Jul 2014 17:22:13 GMT
Yes please.
On Jul 2, 2014 10:21 AM, "Christian Tzolov" <christian.tzolov@gmail.com>
wrote:

> cool :) What is the best way to continue? open a new Jira ticket for it?
>
>
> On Tue, Jul 1, 2014 at 3:22 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> > +1-- very cool. :)
> >
> >
> > On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <gabriel.reid@gmail.com>
> > wrote:
> >
> > > Hey Christian,
> > >
> > > This looks awesome! There have been a bunch of times when I've been
> > > digging around in the planner and wanting to have something like this,
> > > so yes, I definitely think this is useful to have.
> > >
> > > - Gabriel
> > >
> > >
> > > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov
> > > <christian.tzolov@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > While exploring the Crunch MR execution flow I decided to augment the
> > > > excellent pipeline DOT diagram with few additional visualizations of
> > some
> > > > interesting (for me) internal/intermediate pipeline preparation
> states.
> > > > Such like the output-pcollection-targets structure (used for the
> > pipeline
> > > > planning), the Graphs before and after the split up of dependent GBK
> > > nodes
> > > > and the RTNode hierarchy as persistent in the Configuration before
> the
> > > > execution of the pipeline.
> > > > For each diagram I've plotted some relevant internals like the PTypte
> > > > structures. The implementation hack includes 3 additional
> > DotfileWriters
> > > > hooked inside the MSCRPlanner#plan() to intercept the flow.
> > > >
> > > > An example of the diagrams generated from the
> > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked
> > > below.
> > > >
> > > > Do we need such internals visualization? Something like visualization
> > of
> > > > the logical, mapping and physical (e.g. RTNodes) plans of the
> pipeline
> > > > preparation?  What do you think?
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > > >
> > > > Diagrams generated from the
> > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline.
> > > >
> > > > - Dotfile containing all graphs:
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot
> > > >
> > > >
> > > > 1.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png
> > > > - is the existing diagram. It provides very well balanced view of the
> > > > pipeline, showing how the functional blocks are mapped into execution
> > > > Map/Reduce components and the dependencies between them.
> > > >
> > > > 2.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png
> > > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>>
> outputs)
> > > in
> > > > the MSCRPlanner on plan() operation is execution:
> > > > - Each data flow is depicted with different color to indicate the
> > > > overlapping execution paths.
> > > > - The PCollection name, class and PTypes are shown.
> > > >
> > > > 3.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png
> > > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan()
> method.
> > > It
> > > > draws the Vertices with their names, pcollection and ptype. The arc
> > label
> > > > lists the Graph's edge path lists.
> > > >
> > > > 4.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png
> > > > - Graph created in the MSCRPlanner#plan() after the splits up of
> > > dependent
> > > > GBK nodes and break the graph up into connected components - bounded
> by
> > > > read dashed line.
> > > >
> > > > 5.
> > > >
> > >
> >
> https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png
> > > > - Visualizes the RTNodes ussed inside the CrunchMapper and
> > CrunchReducer
> > > as
> > > > well as the Inputs and Outputs.
> > > > - RTNodes are  deserialized from the Job's
> > > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped
> > to
> > > > the containing Map or Reduce tasks and parent Crunch Job. The
> > > relationship
> > > > between RTNodes (e.g. parent/children)  is depicted with arrows.
> > > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into
> > Map<String,
> > > > OutputConfig> and depicted in the magenta subgraph
> > > > - Inputs are deserialized from the CRUNCH_INPUTS into
> Map<FormatBundle,
> > > > Map<Integer, List<Path>>> and depicted in green subgraph
> > > > - The inputs are mapped to the corresponding RTNode using the
> nodeIndex
> > > > reference.
> > > > - Outputs are mapped to the corresponding RTNode by the Output Name
> > > > references
> > > > - There is not good way to print the anonymous DoFn instances.
> > > > - Note: the dependency between the crunch jobs is not drawn as it my
> > > > require access to the competition hook attributes.
> > > > - Note: in order to draw the RTNodes i had to expose its attributes
> via
> > > > public getters.
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message