incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: MemPipeline and context
Date Wed, 30 Jan 2013 14:07:29 GMT
Hey Tim,

Replies inlined.

On Wed, Jan 30, 2013 at 1:33 AM, Tim van Heugten <> wrote:

> Hi,
> Since april I'm using Crunch for a project. We're not doing only linear
> executions of the pipeline, so we're sometimes having issues with how
> Crunch is optimizing our execution graph. We need to add materializations
> here and there as hints to what parts of the graph can be shared for
> outputs and so on.
> Recently we decided to see if 0.4.0-incubating would provide us any
> improvements (I'm afraid not yet). Trying to adapt our code to the new API,
> however, exposed some difficulties and issues. A few bugs have been
> reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
> picking them up.
> The difficulties arise from the newly introduced tight bound with
> TaskInputOutputContext. Now in our jUnit tests we need to inject this
> before the tests can run (many of our DoFns adjust counters of perform
> progress() calls). So far so good, I can use
> CrunchTestSupport.getTestContext(config) with a mocked config and call
> setContext() on the DoFn. But there is some unclarity:
> *Should I call initialize() after setContext()?
> *I can see initialize() is called in setContext(), but this doesn't seem
> documented or guarenteed. Should setContext() be made final so it can be
> documented that initialize does not need to be called after?

Yes, I think so. I'm not sure of the implications of doing that, but I'll
create a branch and see what fails and what needs fixing.

> In our more elaborate tests we use MemPipeline to see the combined effect
> of our DoFns. But there:
> *MemCollection shows ambiguous behaviour wrt initialize/setContext.
> *A parallelDo with a PCollection output makes a call to *just*initialize(), and a parallelDo
with a PTable output makes a call to
> *both* initialize() and setContext(). Currently this fails some of our
> tests because we use counters and progress().

Yeah, that's no good. Let's open a JIRA for that one.

> Finally, I've had to create our own implementation of MemCollection
> altogether*, because the stubbed TaskInputOutputContext is too limited for
> our tests.
> *Stubbed TaskInputOutputContext in MemCollection is unable to handle
> Counters*.
> I'm aware that so much is stated in the javadoc, but I can't choose *not*to use counters
when testing the business code. Because Counters were
> handled (and even counted) in 0.3.0 I'm feeling confident enough about this
> to raise the issue.

Agreed-- I didn't realize MemCollection gave up the ability to use Counters
in that change-- I'm a bit surprised by that. Let's create a JIRA to put it
back in.

> I'm very happy with the api of crunch and would love for this project to
> become more reliable and widely adopted. If there is anything I can do (or
> instruction on where to begin understanding the planning component) please
> let me know.
> Cheers,
> Tim van Heugten
> * Because I use MemPipeline in test contexts only I rely on the mocked
> instance from CrunchTestSupport.getTestContext(conf);, further, replacing
> MemCollection implies replacing MemTable and MemGroupedTable as well.

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message