Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Subject: Re: join with no element appearing in multiple join-pairs
To: <user@flink.apache.org>
References: <56AF344E.70109@mailbox.tu-berlin.de>
From: Fridtjof Sander <fsander@mailbox.tu-berlin.de>
Message-ID: <56AF3580.9060000@mailbox.tu-berlin.de>
Date: Mon, 1 Feb 2016 11:37:52 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0)
 Gecko/20100101 Thunderbird/38.5.1
MIME-Version: 1.0
In-Reply-To: <56AF344E.70109@mailbox.tu-berlin.de>
Content-Type: multipart/alternative;
	boundary="------------030100010702060502020904"

--------------030100010702060502020904
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit

(tried to reformat)

Hi,

I have a problem which seems to be unsolvable in Flink at the moment 
(1.0-Snapshot, current master branch)
and I would kindly ask for some input, ideas on alternative approaches 
or just a confirmatory "yup, that doesn't work".

### Here's the situation:

I have a dataset and its elements are totally ascending sorted by some 
key (Int). Each element has a "next-pointer" to its successor, which is 
just another field with the key of the following element:

x0 -> x1 -> x2 -> x3 -> ... -> xn

The keys are not necessarily increasing by 1, so it may be that: x0 has 
key 2 and x1 has key 10, x2 has 11, x3 has 25 and so on. I need to 
process that set in the following way:

iterate:

find all pairs of elements where "next == key" BUT make sure no element 
appears in multiple pairs

example: do pair (x0, x1), (x2, x3), (x4, x5), ... but don't pair (x1, 
x2), (x3, x4), ...

then, if some condition is met, combine a pair

run above procedure again with switched pairing-condition:

example: do pair (x1, x2), (x3, x4), (x5, x6), ... do not pair (x0, x1), 
(x2, x3), ..

I hope the problem is clear...


### Now my approach: pseudo-scala-code:


val indexed = input.zipWithIndex

val flagged = indexed.map((i, el) => el.setFlag(i % 2 == 0))

val left = flagged.filter(el => el.flag)

val right = flagged.filter(el => !el.flag)

left.fullOuterJoin(right)

  .where(el.next)

  .equalTo(el.key)

  ...


I attach my elements with a temporary key, that is increasing by 1, with 
zipWithIndex. Then, I map that tempKey to a boolean joinFlag: true if 
key is even, false if key is odd. Then I filter all elements with true, 
and put them in a dataset that is the left side of the next == key join. 
The right side are all elements with flag == false In the second run, I 
switch the flag construction to el.setFlag(i % 2 != 0).

That actually works, there is only one problem:


### The problem:


In my approach, I must not loose the total ordering of the data, because 
only if that ordering is preserved, the assignment of alternating 
join-flags works. Initially it is done by range-partitioning and 
partition-sorting. However, that ordering is destroyed, when data is 
shuffled for the join. And I can not restore it, because I have to run 
the whole thing in an iteration, and range-partitioning is not supported 
within iterations.


### Help?

It sounds all very complicated, but the only thing I really have to 
solve is that join without any element appearing in multiple pairs (as 
described in "the situation"). If anyone has any idea how to solve this, 
that person would make my day so hard...

Anyways, thanks for your time!

Best, Fridtjof


Am 01.02.16 um 11:32 schrieb Fridtjof Sander:
> Hi,
>
> I have a problem which seems to be unsolvable in Flink at the moment 
> (1.0-Snapshot, current master branch)
> and I would kindly ask for some input, ideas on alternative approaches 
> or just a confirmatory "yup, that doesn't work".
>
> ### Here's the situation:
>
> I have a dataset and its elements are totally ascending sorted by some 
> key (Int). Each element has a "next-pointer" to its successor, which 
> is just another field with the key of the following element: x0 -> x1 
> -> x2 -> x3 -> ... -> xn The keys are not necessarily increasing by 1, 
> so it may be that: x0 has key 2 and x1 has key 10, x2 has 11, x3 has 
> 25 and so on. I need to process that set in the following way: 
> iterate: find all pairs of elements where "next == key" BUT make sure 
> no element appears in multiple pairs example: do pair (x0, x1), (x2, 
> x3), (x4, x5), ... but don't pair (x1, x2), (x3, x4), ... then, if 
> some condition is met, combine a pair run above procedure again with 
> switched pairing-condition: example: do pair (x1, x2), (x3, x4), (x5, 
> x6), ... do not pair (x0, x1), (x2, x3), .. I hope the problem is 
> clear... ### Now my approach: pseudo-scala-code:
>
> val indexed = input.zipWithIndex val flagged = indexed.map((i, el) => 
> el.setFlag(i % 2 == 0)) val left = flagged.filter(el => el.flag)
> val right = flagged.filter(el => !el.flag) left.fullOuterJoin(right) 
> .where(el.next) .equalTo(el.key) ... I attach my elements with a 
> temporary key, that is increasing by 1, with zipWithIndex. Then, I map 
> that tempKey to a boolean joinFlag: true if key is even, false if key 
> is odd. Then I filter all elements with true, and put them in a 
> dataset that is the left side of the next == key join. The right side 
> are all elements with flag == false In the second run, I switch the 
> flag construction to el.setFlag(i % 2 != 0). That actually works, 
> there is only one problem: ### The problem: In my approach, I must not 
> loose the total ordering of the data, because only if that ordering is 
> preserved, the assignment of alternating join-flags works. Initially 
> it is done by range-partitioning and partition-sorting. However, that 
> ordering is destroyed, when data is shuffled for the join. And I can 
> not restore it, because I have to run the whole thing in an iteration, 
> and range-partitioning is not supported within iterations. ### Help? 
> It sounds all very complicated, but the only thing I really have to 
> solve is that join without any element appearing in multiple pairs (as 
> described in "the situation"). If anyone has any idea how to solve 
> this, that person would make my day so hard... Anyways, thanks for 
> your time! Best, Fridtjof
>


--------------030100010702060502020904
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">
      <div class="moz-text-flowed" style="font-family: -moz-fixed;
        font-size: 14px;" lang="x-unicode">(tried to reformat)<br>
        <br>
        Hi,
        <br>
        <br>
        I have a problem which seems to be unsolvable in Flink at the
        moment (1.0-Snapshot, current master branch)
        <br>
        and I would kindly ask for some input, ideas on alternative
        approaches or just a confirmatory "yup, that doesn't work".
        <br>
        <br>
        ### Here's the situation:
        <br>
        <br>
        I have a dataset and its elements are totally ascending sorted
        by some key (Int). Each element has a "next-pointer" to its
        successor, which is just another field with the key of the
        following element:<br>
        <br>
        x0 -&gt; x1 -&gt; x2 -&gt; x3 -&gt; ... -&gt; xn<br>
        <br>
        The keys are not necessarily increasing by 1, so it may be that:
        x0 has key 2 and x1 has key 10, x2 has 11, x3 has 25 and so on.
        I need to process that set in the following way: <br>
        <br>
        iterate:<br>
        <br>
        find all pairs of elements where "next == key" BUT make sure no
        element appears in multiple pairs<br>
        <br>
        example: do pair (x0, x1), (x2, x3), (x4, x5), ... but don't
        pair (x1, x2), (x3, x4), ...<br>
        <br>
        then, if some condition is met, combine a pair<br>
        <br>
        run above procedure again with switched pairing-condition:<br>
        <br>
        example: do pair (x1, x2), (x3, x4), (x5, x6), ... do not pair
        (x0, x1), (x2, x3), ..<br>
        <br>
        I hope the problem is clear...<br>
        <br>
        <br>
        ### Now my approach: pseudo-scala-code:
        <br>
        <br>
        <br>
        val indexed = input.zipWithIndex<br>
        <br>
        val flagged = indexed.map((i, el) =&gt; el.setFlag(i % 2 == 0))<br>
        <br>
        val left = flagged.filter(el =&gt; el.flag)
        <br>
        <br>
        val right = flagged.filter(el =&gt; !el.flag)<br>
        <br>
        left.fullOuterJoin(right)<br>
        <br>
         .where(el.next)<br>
        <br>
         .equalTo(el.key)<br>
        <br>
         ... <br>
        <br>
        <br>
        I attach my elements with a temporary key, that is increasing by
        1, with zipWithIndex. Then, I map that tempKey to a boolean
        joinFlag: true if key is even, false if key is odd. Then I
        filter all elements with true, and put them in a dataset that is
        the left side of the next == key join. The right side are all
        elements with flag == false In the second run, I switch the flag
        construction to el.setFlag(i % 2 != 0).<br>
        <br>
        That actually works, there is only one problem: <br>
        <br>
        <br>
        ### The problem: <br>
        <br>
        <br>
        In my approach, I must not loose the total ordering of the data,
        because only if that ordering is preserved, the assignment of
        alternating join-flags works. Initially it is done by
        range-partitioning and partition-sorting. However, that ordering
        is destroyed, when data is shuffled for the join. And I can not
        restore it, because I have to run the whole thing in an
        iteration, and range-partitioning is not supported within
        iterations.<br>
        <br>
        <br>
        ### Help? <br>
        <br>
        It sounds all very complicated, but the only thing I really have
        to solve is that join without any element appearing in multiple
        pairs (as described in "the situation"). If anyone has any idea
        how to solve this, that person would make my day so hard... <br>
        <br>
        Anyways, thanks for your time!<br>
        <br>
        Best, Fridtjof
        <br>
        <br>
      </div>
      <br>
      <br>
      Am 01.02.16 um 11:32 schrieb Fridtjof Sander:<br>
    </div>
    <blockquote cite="mid:56AF344E.70109@mailbox.tu-berlin.de"
      type="cite">Hi,
      <br>
      <br>
      I have a problem which seems to be unsolvable in Flink at the
      moment (1.0-Snapshot, current master branch)
      <br>
      and I would kindly ask for some input, ideas on alternative
      approaches or just a confirmatory "yup, that doesn't work".
      <br>
      <br>
      ### Here's the situation:
      <br>
      <br>
      I have a dataset and its elements are totally ascending sorted by
      some key (Int). Each element has a "next-pointer" to its
      successor, which is just another field with the key of the
      following element: x0 -&gt; x1 -&gt; x2 -&gt; x3 -&gt; ... -&gt;
      xn The keys are not necessarily increasing by 1, so it may be
      that: x0 has key 2 and x1 has key 10, x2 has 11, x3 has 25 and so
      on. I need to process that set in the following way: iterate: find
      all pairs of elements where "next == key" BUT make sure no element
      appears in multiple pairs example: do pair (x0, x1), (x2, x3),
      (x4, x5), ... but don't pair (x1, x2), (x3, x4), ... then, if some
      condition is met, combine a pair run above procedure again with
      switched pairing-condition: example: do pair (x1, x2), (x3, x4),
      (x5, x6), ... do not pair (x0, x1), (x2, x3), .. I hope the
      problem is clear... ### Now my approach: pseudo-scala-code:
      <br>
      <br>
      val indexed = input.zipWithIndex val flagged = indexed.map((i, el)
      =&gt; el.setFlag(i % 2 == 0)) val left = flagged.filter(el =&gt;
      el.flag)
      <br>
      val right = flagged.filter(el =&gt; !el.flag)
      left.fullOuterJoin(right) .where(el.next) .equalTo(el.key) ... I
      attach my elements with a temporary key, that is increasing by 1,
      with zipWithIndex. Then, I map that tempKey to a boolean joinFlag:
      true if key is even, false if key is odd. Then I filter all
      elements with true, and put them in a dataset that is the left
      side of the next == key join. The right side are all elements with
      flag == false In the second run, I switch the flag construction to
      el.setFlag(i % 2 != 0). That actually works, there is only one
      problem: ### The problem: In my approach, I must not loose the
      total ordering of the data, because only if that ordering is
      preserved, the assignment of alternating join-flags works.
      Initially it is done by range-partitioning and partition-sorting.
      However, that ordering is destroyed, when data is shuffled for the
      join. And I can not restore it, because I have to run the whole
      thing in an iteration, and range-partitioning is not supported
      within iterations. ### Help? It sounds all very complicated, but
      the only thing I really have to solve is that join without any
      element appearing in multiple pairs (as described in "the
      situation"). If anyone has any idea how to solve this, that person
      would make my day so hard... Anyways, thanks for your time! Best,
      Fridtjof
      <br>
      <br>
    </blockquote>
    <br>
  </body>
</html>

--------------030100010702060502020904--