incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: MemPipeline and context
Date Wed, 30 Jan 2013 15:44:56 GMT
I created and to track these.

On Wed, Jan 30, 2013 at 6:07 AM, Josh Wills <> wrote:

> Hey Tim,
> Replies inlined.
> On Wed, Jan 30, 2013 at 1:33 AM, Tim van Heugten <> wrote:
>> Hi,
>> Since april I'm using Crunch for a project. We're not doing only linear
>> executions of the pipeline, so we're sometimes having issues with how
>> Crunch is optimizing our execution graph. We need to add materializations
>> here and there as hints to what parts of the graph can be shared for
>> outputs and so on.
>> Recently we decided to see if 0.4.0-incubating would provide us any
>> improvements (I'm afraid not yet). Trying to adapt our code to the new API,
>> however, exposed some difficulties and issues. A few bugs have been
>> reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
>> picking them up.
>> The difficulties arise from the newly introduced tight bound with
>> TaskInputOutputContext. Now in our jUnit tests we need to inject this
>> before the tests can run (many of our DoFns adjust counters of perform
>> progress() calls). So far so good, I can use
>> CrunchTestSupport.getTestContext(config) with a mocked config and call
>> setContext() on the DoFn. But there is some unclarity:
>> *Should I call initialize() after setContext()?
>> *I can see initialize() is called in setContext(), but this doesn't seem
>> documented or guarenteed. Should setContext() be made final so it can be
>> documented that initialize does not need to be called after?
> Yes, I think so. I'm not sure of the implications of doing that, but I'll
> create a branch and see what fails and what needs fixing.
>> In our more elaborate tests we use MemPipeline to see the combined effect
>> of our DoFns. But there:
>> *MemCollection shows ambiguous behaviour wrt initialize/setContext.
>> *A parallelDo with a PCollection output makes a call to *just*initialize(), and a
parallelDo with a PTable output makes a call to
>> *both* initialize() and setContext(). Currently this fails some of our
>> tests because we use counters and progress().
> Yeah, that's no good. Let's open a JIRA for that one.
>> Finally, I've had to create our own implementation of MemCollection
>> altogether*, because the stubbed TaskInputOutputContext is too limited for
>> our tests.
>> *Stubbed TaskInputOutputContext in MemCollection is unable to handle
>> Counters*.
>> I'm aware that so much is stated in the javadoc, but I can't choose *not*to use counters
when testing the business code. Because Counters were
>> handled (and even counted) in 0.3.0 I'm feeling confident enough about this
>> to raise the issue.
> Agreed-- I didn't realize MemCollection gave up the ability to use
> Counters in that change-- I'm a bit surprised by that. Let's create a JIRA
> to put it back in.
>> I'm very happy with the api of crunch and would love for this project to
>> become more reliable and widely adopted. If there is anything I can do (or
>> instruction on where to begin understanding the planning component) please
>> let me know.
>> Cheers,
>> Tim van Heugten
>> * Because I use MemPipeline in test contexts only I rely on the mocked
>> instance from CrunchTestSupport.getTestContext(conf);, further, replacing
>> MemCollection implies replacing MemTable and MemGroupedTable as well.
> --
> Director of Data Science
> Cloudera <>
> Twitter: @josh_wills <>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message