beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lukavsk√Ĺ <>
Subject Re: Terasort-like pipeline
Date Fri, 21 Jul 2017 14:31:42 GMT

thanks for answer. I understand that Beam does not want to incorporate 
in the model a way to handle parallelism (because it is left to the 
runner to decide, which I find good). But there are some use-cases where 
it would be beneficial to force *sequential* processing. That is to make 
sure that certain PCollection (or, to state it exactly, each window of a 
PCollection) is processed entirely by a single (fault tolerant) 
instance. The terasort pipeline would then be realizable and I don't 
think that even affects the runners so much. Many of them (actuall all I 
know :)) nevertheless have this option to process a "partition" by a 
single "mapper" or "processor".

Would it be possible to add a sequential form of ParDo into the model? 
Or is it strictly against the philosophy?


On 07/19/2017 10:48 PM, Vikas RK wrote:
> The Beam model doesn't support global sorting, [1] discusses in detail 
> that you might find useful.
> [1] 
> On 19 July 2017 at 02:45, Jan Lukavsk√Ĺ < 
> <>> wrote:
>     Hi all,
>     I'm trying to get better understanding of Beam's internals for the
>     sake of integration with Euphoria API as a DSL ([1]), and while
>     trying to wrap Euphoria's abstractions of outputs, I came across a
>     little issue, that I'm currently a little stuck with. The issue is
>     not important to this question, but it basically boils down to the
>     following: how could I write a Pipeline, that works like a
>     terasort benchmark ([2]). That is - I have a randomly distributed
>     dataset (let's suppose batch case for simplicity), and I want to
>     sort it so that on output I will have N totally sorted partitions.
>     This implies that I can somehow compare the partitions (or
>     partition IDs) on output, so that the following holds: For each
>     partitions X and Y, if partition X is less to partition Y, then
>     all elements in partition X are less or equal to all elements in
>     partition Y.
>     So far, I have not been able to find a clean solution in Beam. I
>     can do a group-by-key operation (where the *key* would be
>     partition Id), and then sort the data within the key. But I have
>     issues outputting the sorted data by a ParDo (because it can run
>     in parallel in theory, and therefore I can either loose the
>     sorting, or run to concurrency issues).
>     Would anyone have an idea about how to do this?
>     Thanks for any comments,
>      Jan
>     [1]
>     <>
>     [2]
>     <>

View raw message