beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chamikara Jayalath <chamik...@apache.org>
Subject Re: Combining multiple DoFn's into one
Date Thu, 01 Jun 2017 21:59:44 GMT
On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <dmitry@postmates.com>
wrote:

> Haha, thanks, Sourabh, you beat me to it :)
>
> On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <dmitry@postmates.com>
> wrote:
>
>> Looks like the expand method should do the trick, similar to how it's
>> done in GroupByKey?
>>
>>
>> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104
>>
>> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <dmitry@postmates.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0
>>> is not apparently supported by Dataflow, but trying to understand the 2.0.0
>>> API better too.
>>>
>>>
I think Dataflow supports 2.0.0 release. Did you find some documentation
that says otherwise ?

- Cham


> I've been trying to find a way of combining two or more DoFn's into a
>>> single one, so that one doesn't have to repeat the same pattern over and
>>> over again.
>>>
>>> Specifically, my use case is getting data out of Redshift via the
>>> "UNLOAD" command:
>>>
>>> 1. Connect to Redshift via Postgres protocol and do the unload
>>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
>>> 2. Connect to S3 and fetch the files that Redshift unloaded there,
>>> converting them into a PCollection.
>>>
>>> It's worth noting here that Redshift generates multiple files, usually
>>> at least 10 or so, the exact number may depend on the amount of cores of
>>> the Redshift instance, some settings, etc. Reading these files in parallel
>>> sounds like a good idea.
>>>
>>> So, it feels like this is just a combination of two FlatMaps:
>>> 1. SQL query -> list of S3 files
>>> 2. List of S3 files -> rows of data
>>>
>>> I could just create two DoFns for that and make people combine them, but
>>> that feels like an overkill. Instead, one should just call ReadFromRedshift
>>> and not really care about what exactly happens under the hood.
>>>
>>> Plus, it just feels like the ability of taking somewhat complex pieces
>>> of the execution graph and encapsulating them into a DoFn would be a nice
>>> capability.
>>>
>>> Are there any officially recommended ways to do that?
>>>
>>> Thank you.
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>>
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Mime
View raw message