crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-420) Breakpoints Not Working
Date Wed, 18 Jun 2014 07:29:02 GMT


Gabriel Reid commented on CRUNCH-420:

I think your analysis is correct, in that there are two situations where we want this kind
of functionality. 

I don't think that there is a reduce-side version of the problem where we're not between two
GBKs. A single stream coming out of a GBK will only split into multiple streams at the latest
point possible, and things will only be run once anyhow thanks to the multiple outputs from
a reducer.

I think that the more "correct" method that should be called for this kind of functionality
(in documentation) is {{PCollection.cache()}} instead of {{PCollection.materialize}}. The
cache method is just a call to materialize anyhow, but I think it's more consistent with the
intended meaning of the cache method in a Spark context (is that right?)

The patch goes in the same direction that I was thinking, but there still seem to be some
issues with it. If the breakpointed pipeline in Breakpoint2IT actually gets run, it crashes
with a StackOverflowError.

I put together a little mini-test to check demonstrate what this patch is doing (actually,
it might be good to use a more simple situation like this in Breakpoint2IT as well to make
it easier to debug). My test case simply reads in a single PCollection of strings, maps it
to a table using an IdentityFn, runs the table through an IdentityFn, and then sends it to
two GBKs which are then ungrouped and written.

Running my mini test without a breakpoint gives a job plan that looks like this, as expected:


Running the mini test with a breakpoint gives this job plan:


I think we want to have two jobs when the breakpoint is enabled -- a single map-only job,
and then two jobs that do the grouping stemming from the output of the first job.

> Breakpoints Not Working
> -----------------------
>                 Key: CRUNCH-420
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>         Environment: Crunch 0.8.2
>            Reporter: Allan Shoup
>            Assignee: Josh Wills
>         Attachments:, CRUNCH-420.patch, testBreakpoint_plan.png, withbreakpoint.png,
> Reading through CRUNCH-294, it looks like materialize is supposed to function as a breakpoint
to the planner. I've seen several plans where it appeared to me a particular DoFn shouldn't
have been repeated, but it was.
> I'll attach some supporting material.

This message was sent by Atlassian JIRA

View raw message