beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <lc...@google.com>
Subject Re: Is this a valid usecase for Apache Beam and Google dataflow
Date Tue, 20 Jun 2017 16:54:12 GMT
Take a look at session windows[1]. As long as the messages you post to
Pubsub aren't spaced out farther then the session gap duration they will
all get grouped together.
It seems as though it would be much simpler to just run a separate Apache
Beam job for each internal job you want to process since you won't have to
deal with potentially late data exceeding the session gap duration.

1: https://cloud.google.com/dataflow/model/windowing#session-windows

On Tue, Jun 20, 2017 at 9:10 AM, Randal Moore <rdmoore79@gmail.com> wrote:

> Just starting looking at Beam this week as a candidate for executing some
> fairly CPU intensive work.  I am curious if the stream-oriented features of
> Beam are a match for my usecase. My user will submit a large number of
> computations to the system (as a "job").  Those computations can be
> expressed in a series of DoFn whose results can be stored (in for example
> google datastore).
>
> My initial idea was to post the individual messages (anywhere from 1000 to
> 1,000,000 per job) to a google pub/sub topic that is watched by a google
> dataflow job. I can write a prototype that performs the appropriate
> transformation and posts the results.
>
> But...
>
> I cannot find any way to capture the notion of the completion of the
> original "user job".  I want to post to pub/sub that all individual
> calculations are complete. I was hoping that I could write a CombineFn that
> would be able to post a message as each job finishes but Combine needs a
> window and I don't see how to define it.
>
> Is there a way to define a window that is defined by the user's job - I
> know exactly which individual computations are part of the user's job (and
> exactly how many).  But all the grouping that I've discovered so far seems
> to be well defined at compile time (e.g., number of messages in the window,
> or number of partitions).
>
> Is this the wrong usecase for dataflow/beam?  Is there a better way to
> express this problem?
>
> Thanks,
> rdm
>

Mime
View raw message