beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukasz Cwik <lc...@google.com>
Subject Re: Beam and Python API: Pandas/Numpy?
Date Tue, 03 Oct 2017 16:46:46 GMT
This is neat.

On Fri, Sep 29, 2017 at 1:26 PM, Vilhelm von Ehrenheim <
vonehrenheim@gmail.com> wrote:

> Hi Steve!
> I have several pipelines that successfully use both numpy and scikit
> models without any problems. I don't think I use Pandas atm but I'm sure
> that is fine too.
>
> However, you might have to do some special stuff if you encounter
> serializabillity problems. I also have tensorflow models in use, which were
> a bit trickier to get to work because of serialization problems as you
> mention. For that I needed to load one model instance per thread using
> thread.local as is done here:
>
> https://github.com/tensorflow/transform/blob/master/
> tensorflow_transform/beam/impl.py
>
> (I realize that this file has evolved a bit since i last looked at it.
> Might be worth looking at an older version of the file as its quite
> advanced now.)
>
> So, when serializability is not possible, you can still initialize objects
> locally in threads and let bundles that are executed in the same thread use
> the locally instantiated objects instead of sharing one intantiation across
> all bundles and threads.
>
> Br,
> Vilhelm
>
> On 29 Sep 2017 17:17, "Steven DeLaurentis" <timeisapear@gmail.com> wrote:
>
> Hi everyone,
>
> Came across this interesting project recently. Read through some of the
> docs and still had a question: is it possible to use NumPy/Pandas in the
> DoFn of a Beam? Or does the requirement of a serializable function preclude
> this possibility?
>
> Thanks,
> Steve
>
>
>

Mime
View raw message