incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim van Heugten <sti...@gmail.com>
Subject Re: MemPipeline and context
Date Tue, 05 Feb 2013 14:18:00 GMT
Hi,

It turns out the data in the two branches that are unioned in union2 is not
mutually exclusive (counter to what I was expecting). Probably we should
expect data duplication.

However, this does still not explain why sometimes we find data duplication
and sometimes we don't.

Will keep you posted,

Tim


On Tue, Feb 5, 2013 at 11:32 AM, Tim van Heugten <stimme@gmail.com> wrote:

> Hi Gabriel,
>
> I've been unsuccessful so far to reproduce the issue in a controlled
> environment. As said, its fragile, maybe the types involved play a role, so
> when I tried to simplify those I broke the failure condition.
> I decide it's time to try providing more information without giving an
> explicit example.
>
> The pipeline we build is illustrated here: http://yuml.me/8ef99512.
> Depending on where we materialize the data occurs twice in UP.
> The EITPI job filters the exact opposite of the filter branch. In PWR only
> data from EITPI is passed through, while the PITP data is used to modify it.
> Below you find the job names as executed when dataduplication occurs,
> materializations occur before BTO(*) and after UP.
> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156/p4)"
>
> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)"
>
> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)"
>
> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunch655004156/p6)"
> "Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)"
> "Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-dump/traces)"
>
> Here are the jobs performed when materialization is added between BTO and
> gbk:
>
> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-551174870/p4)"
>
> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p2)"
>
> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)"
> "Avro(/tmp/crunch-551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)"
> "Avro(/tmp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)"
> "Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/traces)"
>
> Without changing changing anything else, the added materialization fixes
> the issue of data duplication.
>
> If you have any clues how I can extract a clean working example I'm happy
> to hear.
>
>
> *) This materialization probably explains the second job, however, where
> the filtered data is joined is lost on me. This is not the cause though,
> with just one materialize at the end, after UP, the data count still
> doubled. The jobs then look like this:
> "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677/p4)"
>
> "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch369510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)"
>
> "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+Avro(/tmp/crunch369510677/p2)"
> "Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces)"
> "Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch369510677/p8)"
>
> BR,
>
> Tim van Heugten
>
>
> On Thu, Jan 31, 2013 at 9:27 PM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>
>> Hi Tim,
>>
>> On 31 Jan 2013, at 10:45, Tim van Heugten <stimme@gmail.com> wrote:
>>
>> > Hi Gabriel,
>> >
>> > For the most part it is similar to what was send around recently on
>> this mailinglist, see:
>> > From  Dave Beech <d...@paraliatech.com>
>> > Subject       Question about mapreduce job planner
>> > Date  Tue, 15 Jan 2013 11:41:42 GMT
>> >
>> > So, the common path before multiple outputs branch is executed twice.
>> Sometimes the issues seem related to unions though, i.e. multiple inputs.
>> We seem to have been troubled by a grouped table parallelDo on a
>> table-union-gbk that got its data twice (all grouped doubled in size).
>> Inserting a materialize between the union and groupByKey solved the issue.
>> >
>> > These issues seem very fragile (so they're fixed easily by changing
>> something that's irrelevant to the output), so usually we just add or
>> remove a materialization to make it run again.
>> > I'll see if I can cleanly reproduce the data duplication issue later
>> this week.
>>
>> Ok, that would be great if you could replicate it in a small test, thanks!
>>
>> - Gabriel
>
>
>

Mime
View raw message