Mailing-List: contact user-help@beam.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@beam.apache.org
Subject: Re: Terasort-like pipeline
To: user@beam.apache.org
References: <46519901-a72c-5354-0c4b-7901197c064e@seznam.cz>
 <CAFPXSNNXiTpTqxxuacrAk1b04Y5=efgjs-ORaOERBXW=R9b6cA@mail.gmail.com>
From: =?UTF-8?Q?Jan_Lukavsk=c3=bd?= <je.ik@seznam.cz>
Message-ID: <1d0d4abf-8697-d180-de64-a45504f33bd0@seznam.cz>
Date: Fri, 21 Jul 2017 16:31:42 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <CAFPXSNNXiTpTqxxuacrAk1b04Y5=efgjs-ORaOERBXW=R9b6cA@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------4036E0A7BDA2EE5E9A7D5679"
Content-Language: en-US
archived-at: Fri, 21 Jul 2017 14:31:57 -0000

This is a multi-part message in MIME format.
--------------4036E0A7BDA2EE5E9A7D5679
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi,

thanks for answer. I understand that Beam does not want to incorporate 
in the model a way to handle parallelism (because it is left to the 
runner to decide, which I find good). But there are some use-cases where 
it would be beneficial to force *sequential* processing. That is to make 
sure that certain PCollection (or, to state it exactly, each window of a 
PCollection) is processed entirely by a single (fault tolerant) 
instance. The terasort pipeline would then be realizable and I don't 
think that even affects the runners so much. Many of them (actuall all I 
know :)) nevertheless have this option to process a "partition" by a 
single "mapper" or "processor".

Would it be possible to add a sequential form of ParDo into the model? 
Or is it strictly against the philosophy?

  Jan


On 07/19/2017 10:48 PM, Vikas RK wrote:
> The Beam model doesn't support global sorting, [1] discusses in detail 
> that you might find useful.
>
> [1] 
> https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E
>
> On 19 July 2017 at 02:45, Jan Lukavský <je.ik@seznam.cz 
> <mailto:je.ik@seznam.cz>> wrote:
>
>     Hi all,
>
>     I'm trying to get better understanding of Beam's internals for the
>     sake of integration with Euphoria API as a DSL ([1]), and while
>     trying to wrap Euphoria's abstractions of outputs, I came across a
>     little issue, that I'm currently a little stuck with. The issue is
>     not important to this question, but it basically boils down to the
>     following: how could I write a Pipeline, that works like a
>     terasort benchmark ([2]). That is - I have a randomly distributed
>     dataset (let's suppose batch case for simplicity), and I want to
>     sort it so that on output I will have N totally sorted partitions.
>     This implies that I can somehow compare the partitions (or
>     partition IDs) on output, so that the following holds: For each
>     partitions X and Y, if partition X is less to partition Y, then
>     all elements in partition X are less or equal to all elements in
>     partition Y.
>
>     So far, I have not been able to find a clean solution in Beam. I
>     can do a group-by-key operation (where the *key* would be
>     partition Id), and then sort the data within the key. But I have
>     issues outputting the sorted data by a ParDo (because it can run
>     in parallel in theory, and therefore I can either loose the
>     sorting, or run to concurrency issues).
>
>     Would anyone have an idea about how to do this?
>
>     Thanks for any comments,
>
>      Jan
>
>     [1] https://github.com/seznam/euphoria
>     <https://github.com/seznam/euphoria>
>
>     [2]
>     https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html
>     <https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html>
>
>


--------------4036E0A7BDA2EE5E9A7D5679
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hi,</p>
    <p>thanks for answer. I understand that Beam does not want to
      incorporate in the model a way to handle parallelism (because it
      is left to the runner to decide, which I find good). But there are
      some use-cases where it would be beneficial to force *sequential*
      processing. That is to make sure that certain PCollection (or, to
      state it exactly, each window of a PCollection) is processed
      entirely by a single (fault tolerant) instance. The terasort
      pipeline would then be realizable and I don't think that even
      affects the runners so much. Many of them (actuall all I know :))
      nevertheless have this option to process a "partition" by a single
      "mapper" or "processor".</p>
    <p>Would it be possible to add a sequential form of ParDo into the
      model? Or is it strictly against the philosophy?</p>
    <p> Jan<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 07/19/2017 10:48 PM, Vikas RK wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFPXSNNXiTpTqxxuacrAk1b04Y5=efgjs-ORaOERBXW=R9b6cA@mail.gmail.com">
      <div dir="ltr">The Beam model doesn't support global sorting, [1]
        discusses in detail that you might find useful.  
        <div><br>
        </div>
        <div>[1] <a
href="https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E"
            moz-do-not-send="true">https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E</a></div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On 19 July 2017 at 02:45, Jan Lukavský
          <span dir="ltr">&lt;<a href="mailto:je.ik@seznam.cz"
              target="_blank" moz-do-not-send="true">je.ik@seznam.cz</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
            <br>
            I'm trying to get better understanding of Beam's internals
            for the sake of integration with Euphoria API as a DSL
            ([1]), and while trying to wrap Euphoria's abstractions of
            outputs, I came across a little issue, that I'm currently a
            little stuck with. The issue is not important to this
            question, but it basically boils down to the following: how
            could I write a Pipeline, that works like a terasort
            benchmark ([2]). That is - I have a randomly distributed
            dataset (let's suppose batch case for simplicity), and I
            want to sort it so that on output I will have N totally
            sorted partitions. This implies that I can somehow compare
            the partitions (or partition IDs) on output, so that the
            following holds: For each partitions X and Y, if partition X
            is less to partition Y, then all elements in partition X are
            less or equal to all elements in partition Y.<br>
            <br>
            So far, I have not been able to find a clean solution in
            Beam. I can do a group-by-key operation (where the *key*
            would be partition Id), and then sort the data within the
            key. But I have issues outputting the sorted data by a ParDo
            (because it can run in parallel in theory, and therefore I
            can either loose the sorting, or run to concurrency issues).<br>
            <br>
            Would anyone have an idea about how to do this?<br>
            <br>
            Thanks for any comments,<br>
            <br>
             Jan<br>
            <br>
            [1] <a href="https://github.com/seznam/euphoria"
              rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/seznam/euph<wbr>oria</a><br>
            <br>
            [2] <a
href="https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html"
              rel="noreferrer" target="_blank" moz-do-not-send="true">https://hadoop.apache.org/docs<wbr>/r2.7.1/api/org/apache/hadoop/<wbr>examples/terasort/package-<wbr>summary.html</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------4036E0A7BDA2EE5E9A7D5679--