beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lukavsk√Ĺ <je...@seznam.cz>
Subject Re: Terasort-like pipeline
Date Fri, 21 Jul 2017 14:31:42 GMT
Hi,

thanks for answer. I understand that Beam does not want to incorporate 
in the model a way to handle parallelism (because it is left to the 
runner to decide, which I find good). But there are some use-cases where 
it would be beneficial to force *sequential* processing. That is to make 
sure that certain PCollection (or, to state it exactly, each window of a 
PCollection) is processed entirely by a single (fault tolerant) 
instance. The terasort pipeline would then be realizable and I don't 
think that even affects the runners so much. Many of them (actuall all I 
know :)) nevertheless have this option to process a "partition" by a 
single "mapper" or "processor".

Would it be possible to add a sequential form of ParDo into the model? 
Or is it strictly against the philosophy?

  Jan


On 07/19/2017 10:48 PM, Vikas RK wrote:
> The Beam model doesn't support global sorting, [1] discusses in detail 
> that you might find useful.
>
> [1] 
> https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E
>
> On 19 July 2017 at 02:45, Jan Lukavsk√Ĺ <je.ik@seznam.cz 
> <mailto:je.ik@seznam.cz>> wrote:
>
>     Hi all,
>
>     I'm trying to get better understanding of Beam's internals for the
>     sake of integration with Euphoria API as a DSL ([1]), and while
>     trying to wrap Euphoria's abstractions of outputs, I came across a
>     little issue, that I'm currently a little stuck with. The issue is
>     not important to this question, but it basically boils down to the
>     following: how could I write a Pipeline, that works like a
>     terasort benchmark ([2]). That is - I have a randomly distributed
>     dataset (let's suppose batch case for simplicity), and I want to
>     sort it so that on output I will have N totally sorted partitions.
>     This implies that I can somehow compare the partitions (or
>     partition IDs) on output, so that the following holds: For each
>     partitions X and Y, if partition X is less to partition Y, then
>     all elements in partition X are less or equal to all elements in
>     partition Y.
>
>     So far, I have not been able to find a clean solution in Beam. I
>     can do a group-by-key operation (where the *key* would be
>     partition Id), and then sort the data within the key. But I have
>     issues outputting the sorted data by a ParDo (because it can run
>     in parallel in theory, and therefore I can either loose the
>     sorting, or run to concurrency issues).
>
>     Would anyone have an idea about how to do this?
>
>     Thanks for any comments,
>
>      Jan
>
>     [1] https://github.com/seznam/euphoria
>     <https://github.com/seznam/euphoria>
>
>     [2]
>     https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html
>     <https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html>
>
>


Mime
View raw message