Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
From: Adrian Bartnik <bartnik@campus.tu-berlin.de>
Subject: Flink programm with for loop yields wrong results when run in
 parallel
To: <user@flink.apache.org>
Message-ID: <577A32B6.80208@campus.tu-berlin.de>
Date: Mon, 4 Jul 2016 11:56:06 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.8.0
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="------------020907090504050807000405"
archived-at: Mon, 04 Jul 2016 09:56:21 -0000

--------------020907090504050807000405
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I have a Flink programm, which outputs wrong results once I set the 
parallelism to a value larger that 1.
If I run the programm with parallelism 1, everything works fine.

The algorithm works on one input dataset, which will iteratively be 
split until the desired output split size is reached.
The way how to split the cluster in each iteration is also determined 
iteratively.

Pseudocode:

val input = DataSet

for (currentSplitNumber <- 1 to numberOfSplits) { // Split dataset until 
desired #splits was reached
     // Iteratively compute best split
     Dataset determinedSplit = Iteration involving input

     // Split dataset to 2 smaller ones
     val tmpDataSet1 = determinedSplit.filter(x ==1) ...
     val tmpDataSet2 = determinedSplit.filter(x ==0) ...

     tmpDataSet1.count() // These are necessary, to store the size of 
each split
     tmpDataSet2.count()

     // Store tmpDataSet1 and 2 as they are needed in one of the next 
loop executions (as dataset to be split)
     ...

}

In all comes down to 2 nested loops, one of which can be replaced by a 
iteration.
As nested iterations are not supported yet, I do not know how to avoid 
the outer loop.

Is this a know problem, and if yes, what would be a solution?

Best,
Adrian

--------------020907090504050807000405
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hi,<br>
    <br>
    I have a Flink programm, which outputs wrong results once I set the
    parallelism to a value larger that 1.<br>
    If I run the programm with parallelism 1, everything works fine.<br>
    <br>
    The algorithm works on one input dataset, which will iteratively be
    split until the desired output split size is reached. <br>
    The way how to split the cluster in each iteration is also
    determined iteratively.<br>
    <br>
    Pseudocode: <br>
    <br>
    val input = DataSet<br>
    <br>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    for (currentSplitNumber &lt;- 1 to numberOfSplits) { // Split
    dataset until desired #splits was reached<br>
        // Iteratively compute best split<br>
        Dataset determinedSplit = Iteration involving input <br>
    <br>
        // Split dataset to 2 smaller ones<br>
        val tmpDataSet1 = determinedSplit.filter(x ==1) ...<br>
        val tmpDataSet2 = determinedSplit.filter(x ==0) ...<br>
    <br>
        tmpDataSet1.count() // These are necessary, to store the size of
    each split<br>
        tmpDataSet2.count()<br>
    <br>
        // Store tmpDataSet1 and 2 as they are needed in one of the next
    loop executions (as dataset to be split)<br>
        ...<br>
    <br>
    }<br>
    <br>
    In all comes down to 2 nested loops, one of which can be replaced by
    a iteration.<br>
    As nested iterations are not supported yet, I do not know how to
    avoid the outer loop.<br>
    <br>
    Is this a know problem, and if yes, what would be a solution?<br>
    <br>
    Best,<br>
    Adrian<br>
  </body>
</html>

--------------020907090504050807000405--