crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim van Heugten <>
Subject Re: MemPipeline and context
Date Tue, 05 Feb 2013 10:32:58 GMT
Hi Gabriel,

I've been unsuccessful so far to reproduce the issue in a controlled
environment. As said, its fragile, maybe the types involved play a role, so
when I tried to simplify those I broke the failure condition.
I decide it's time to try providing more information without giving an
explicit example.

The pipeline we build is illustrated here:
Depending on where we materialize the data occurs twice in UP.
The EITPI job filters the exact opposite of the filter branch. In PWR only
data from EITPI is passed through, while the PITP data is used to modify it.
Below you find the job names as executed when dataduplication occurs,
materializations occur before BTO(*) and after UP.

Here are the jobs performed when materialization is added between BTO and

Without changing changing anything else, the added materialization fixes
the issue of data duplication.

If you have any clues how I can extract a clean working example I'm happy
to hear.

*) This materialization probably explains the second job, however, where
the filtered data is joined is lost on me. This is not the cause though,
with just one materialize at the end, after UP, the data count still
doubled. The jobs then look like this:


Tim van Heugten

On Thu, Jan 31, 2013 at 9:27 PM, Gabriel Reid <>wrote:

> Hi Tim,
> On 31 Jan 2013, at 10:45, Tim van Heugten <> wrote:
> > Hi Gabriel,
> >
> > For the most part it is similar to what was send around recently on this
> mailinglist, see:
> > From  Dave Beech <>
> > Subject       Question about mapreduce job planner
> > Date  Tue, 15 Jan 2013 11:41:42 GMT
> >
> > So, the common path before multiple outputs branch is executed twice.
> Sometimes the issues seem related to unions though, i.e. multiple inputs.
> We seem to have been troubled by a grouped table parallelDo on a
> table-union-gbk that got its data twice (all grouped doubled in size).
> Inserting a materialize between the union and groupByKey solved the issue.
> >
> > These issues seem very fragile (so they're fixed easily by changing
> something that's irrelevant to the output), so usually we just add or
> remove a materialization to make it run again.
> > I'll see if I can cleanly reproduce the data duplication issue later
> this week.
> Ok, that would be great if you could replicate it in a small test, thanks!
> - Gabriel

View raw message