beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Demeshchuk <dmi...@postmates.com>
Subject Re: Combining multiple DoFn's into one
Date Thu, 01 Jun 2017 21:56:05 GMT
Haha, thanks, Sourabh, you beat me to it :)

On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <dmitry@postmates.com>
wrote:

> Looks like the expand method should do the trick, similar to how it's done
> in GroupByKey?
>
> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67
> f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104
>
> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <dmitry@postmates.com>
> wrote:
>
>> Hi folks,
>>
>> I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0
>> is not apparently supported by Dataflow, but trying to understand the 2.0.0
>> API better too.
>>
>> I've been trying to find a way of combining two or more DoFn's into a
>> single one, so that one doesn't have to repeat the same pattern over and
>> over again.
>>
>> Specifically, my use case is getting data out of Redshift via the
>> "UNLOAD" command:
>>
>> 1. Connect to Redshift via Postgres protocol and do the unload
>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
>> 2. Connect to S3 and fetch the files that Redshift unloaded there,
>> converting them into a PCollection.
>>
>> It's worth noting here that Redshift generates multiple files, usually at
>> least 10 or so, the exact number may depend on the amount of cores of the
>> Redshift instance, some settings, etc. Reading these files in parallel
>> sounds like a good idea.
>>
>> So, it feels like this is just a combination of two FlatMaps:
>> 1. SQL query -> list of S3 files
>> 2. List of S3 files -> rows of data
>>
>> I could just create two DoFns for that and make people combine them, but
>> that feels like an overkill. Instead, one should just call ReadFromRedshift
>> and not really care about what exactly happens under the hood.
>>
>> Plus, it just feels like the ability of taking somewhat complex pieces of
>> the execution graph and encapsulating them into a DoFn would be a nice
>> capability.
>>
>> Are there any officially recommended ways to do that?
>>
>> Thank you.
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.

Mime
View raw message