crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim van Heugten <>
Subject Re: MemPipeline and context
Date Thu, 31 Jan 2013 09:45:58 GMT
Hi Gabriel,

For the most part it is similar to what was send around recently on this
mailinglist, see:
FromDave Beech <> SubjectQuestion about mapreduce job
plannerDateTue, 15 Jan 2013 11:41:42 GMT

So, the common path before multiple outputs branch is executed twice.
Sometimes the issues seem related to unions though, i.e. multiple inputs.
We seem to have been troubled by a grouped table parallelDo on a
table-union-gbk that got its data twice (all grouped doubled in size).
Inserting a materialize between the union and groupByKey solved the issue.

These issues seem very fragile (so they're fixed easily by changing
something that's irrelevant to the output), so usually we just add or
remove a materialization to make it run again.
I'll see if I can cleanly reproduce the data duplication issue later this



On Wed, Jan 30, 2013 at 8:51 PM, Gabriel Reid <>wrote:

> Hi Tim,
> On Wed, Jan 30, 2013 at 10:33 AM, Tim van Heugten <>wrote:
>> Since april I'm using Crunch for a project. We're not doing only linear
>> executions of the pipeline, so we're sometimes having issues with how
>> Crunch is optimizing our execution graph. We need to add materializations
>> here and there as hints to what parts of the graph can be shared for
>> outputs and so on.
> About the extra calls to materialize to force changes to the execution
> plan: I remember seeing this previously. We've discussed adding something
> specifically for this functionality to the API, although it hasn't yet
> happened.
> Could you give an example of a situation where these extra materialize
> calls get added? That would be useful for validating the addition to the
> API.
> Thanks,
> Gabriel

View raw message