crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-420) Breakpoints Not Working
Date Wed, 18 Jun 2014 07:29:02 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034925#comment-14034925
] 

Gabriel Reid commented on CRUNCH-420:
-------------------------------------

I think your analysis is correct, in that there are two situations where we want this kind
of functionality. 

I don't think that there is a reduce-side version of the problem where we're not between two
GBKs. A single stream coming out of a GBK will only split into multiple streams at the latest
point possible, and things will only be run once anyhow thanks to the multiple outputs from
a reducer.

I think that the more "correct" method that should be called for this kind of functionality
(in documentation) is {{PCollection.cache()}} instead of {{PCollection.materialize}}. The
cache method is just a call to materialize anyhow, but I think it's more consistent with the
intended meaning of the cache method in a Spark context (is that right?)

The patch goes in the same direction that I was thinking, but there still seem to be some
issues with it. If the breakpointed pipeline in Breakpoint2IT actually gets run, it crashes
with a StackOverflowError.

I put together a little mini-test to check demonstrate what this patch is doing (actually,
it might be good to use a more simple situation like this in Breakpoint2IT as well to make
it easier to debug). My test case simply reads in a single PCollection of strings, maps it
to a table using an IdentityFn, runs the table through an IdentityFn, and then sends it to
two GBKs which are then ungrouped and written.

Running my mini test without a breakpoint gives a job plan that looks like this, as expected:

!withoutbreakpoint.png!

Running the mini test with a breakpoint gives this job plan:

!withbreakpoint.png!

I think we want to have two jobs when the breakpoint is enabled -- a single map-only job,
and then two jobs that do the grouping stemming from the output of the first job.


> Breakpoints Not Working
> -----------------------
>
>                 Key: CRUNCH-420
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-420
>             Project: Crunch
>          Issue Type: Bug
>         Environment: Crunch 0.8.2
>            Reporter: Allan Shoup
>            Assignee: Josh Wills
>         Attachments: Breakpoint2IT.java, CRUNCH-420.patch, testBreakpoint_plan.png, withbreakpoint.png,
withoutbreakpoint.png
>
>
> Reading through CRUNCH-294, it looks like materialize is supposed to function as a breakpoint
to the planner. I've seen several plans where it appeared to me a particular DoFn shouldn't
have been repeated, but it was.
> I'll attach some supporting material.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message