crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: MemPipeline and context
Date Thu, 31 Jan 2013 20:27:39 GMT
Hi Tim,

On 31 Jan 2013, at 10:45, Tim van Heugten <> wrote:

> Hi Gabriel,
> For the most part it is similar to what was send around recently on this mailinglist,
> From	Dave Beech <>
> Subject	Question about mapreduce job planner
> Date	Tue, 15 Jan 2013 11:41:42 GMT
> So, the common path before multiple outputs branch is executed twice. Sometimes the issues
seem related to unions though, i.e. multiple inputs. We seem to have been troubled by a grouped
table parallelDo on a table-union-gbk that got its data twice (all grouped doubled in size).
Inserting a materialize between the union and groupByKey solved the issue.
> These issues seem very fragile (so they're fixed easily by changing something that's
irrelevant to the output), so usually we just add or remove a materialization to make it run
> I'll see if I can cleanly reproduce the data duplication issue later this week.

Ok, that would be great if you could replicate it in a small test, thanks!

- Gabriel
View raw message