beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pawel Szczur (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-315) Flink Runner compares keys unencoded which may produce incorrect results
Date Thu, 02 Jun 2016 14:20:59 GMT

    [ https://issues.apache.org/jira/browse/BEAM-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312335#comment-15312335
] 

Pawel Szczur edited comment on BEAM-315 at 6/2/16 2:20 PM:
-----------------------------------------------------------

One more finding: the reproduction of the bug depends on the total size of data not number
of keys or elements per key. I've just modified the example by adding 5KB blob to each item
and it fails with considerably smaller number of elements. May it be related to a mechanism
used for keeping the data off the memory.

After taking a look at data it looks weird:
In a given shard there are sub-shards of data. Take a look at execution_split_sorted, the
columns are: worker, key, number of values.

If you search, you will notice that a pairs of (worker, number of values) appear multiple
times.


was (Author: pawelszczur@gmail.com):
One more finding: the reproduction of the bug depends on the total size of data not number
of keys or elements per key. I've just modified the example by adding 5KB blob to each item
and it fails with considerably smaller number of elements. May it be related to a mechanism
used for keeping the data off the memory.

> Flink Runner compares keys unencoded which may produce incorrect results
> ------------------------------------------------------------------------
>
>                 Key: BEAM-315
>                 URL: https://issues.apache.org/jira/browse/BEAM-315
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>    Affects Versions: 0.1.0-incubating
>            Reporter: Pawel Szczur
>            Assignee: Aljoscha Krettek
>         Attachments: CoGroupPipelineStringKey.java, execution.log, execution_split.log,
execution_split_sorted.log
>
>
> Same keys are processed multiple times.
> A repo to reproduce the bug:
> https://github.com/orian/cogroup-wrong-grouping
> Discussion:
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201605.mbox/%3CCAB2uKkG2xHsWpLFUkYnt8eEzdxU%3DB_nu6crTwVi-ZuUpugxkPQ%40mail.gmail.com%3E
> Notice: I haven't tested other runners (didn't manage to configure Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message