incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim van Heugten <>
Subject MemPipeline and context
Date Wed, 30 Jan 2013 09:33:49 GMT

Since april I'm using Crunch for a project. We're not doing only linear
executions of the pipeline, so we're sometimes having issues with how
Crunch is optimizing our execution graph. We need to add materializations
here and there as hints to what parts of the graph can be shared for
outputs and so on.

Recently we decided to see if 0.4.0-incubating would provide us any
improvements (I'm afraid not yet). Trying to adapt our code to the new API,
however, exposed some difficulties and issues. A few bugs have been
reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
picking them up.

The difficulties arise from the newly introduced tight bound with
TaskInputOutputContext. Now in our jUnit tests we need to inject this
before the tests can run (many of our DoFns adjust counters of perform
progress() calls). So far so good, I can use
CrunchTestSupport.getTestContext(config) with a mocked config and call
setContext() on the DoFn. But there is some unclarity:
*Should I call initialize() after setContext()?
*I can see initialize() is called in setContext(), but this doesn't seem
documented or guarenteed. Should setContext() be made final so it can be
documented that initialize does not need to be called after?

In our more elaborate tests we use MemPipeline to see the combined effect
of our DoFns. But there:
*MemCollection shows ambiguous behaviour wrt initialize/setContext.
*A parallelDo with a PCollection output makes a call to
*just*initialize(), and a parallelDo with a PTable output makes a call
*both* initialize() and setContext(). Currently this fails some of our
tests because we use counters and progress().

Finally, I've had to create our own implementation of MemCollection
altogether*, because the stubbed TaskInputOutputContext is too limited for
our tests.
*Stubbed TaskInputOutputContext in MemCollection is unable to handle
I'm aware that so much is stated in the javadoc, but I can't choose
*not*to use counters when testing the business code. Because Counters
handled (and even counted) in 0.3.0 I'm feeling confident enough about this
to raise the issue.

I'm very happy with the api of crunch and would love for this project to
become more reliable and widely adopted. If there is anything I can do (or
instruction on where to begin understanding the planning component) please
let me know.


Tim van Heugten

* Because I use MemPipeline in test contexts only I rely on the mocked
instance from CrunchTestSupport.getTestContext(conf);, further, replacing
MemCollection implies replacing MemTable and MemGroupedTable as well.

View raw message