flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gyula Fóra <gyula.f...@gmail.com>
Subject Re: Types in the Python API
Date Fri, 31 Jul 2015 12:01:49 GMT
In any case, thank you guys for the exhaustive discussion :D

Aljoscha Krettek <aljoscha@apache.org> ezt írta (időpont: 2015. júl. 31.,
P, 13:52):

> Yes, I wouldn't deal with that now, that's orthogonal to the Types issue.
>
> On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c.schepler@web.de> wrote:
>
> > I feel like we drifted away from the original topic a bit, but alright.
> >
> > I don't consider it a pity we created a proprietary protocol. we know
> > exactly how it works and what it is capable of. It is also made exactly
> > for our use case, in contrast to general purpose libraries. If we ever
> > decide that the current implementation is lacking we can always look for
> > better alternatives and swap stuff out fairly easily. bonus points for
> > being able to swap out only part of the system, since there is a clear
> > distinction between *what*(how the data is serialized) and
> > *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way
> > too often bundled together.
> >
> > on the other hand, let's assume we went from the start with one of these
> > magic libraries. if you then notice it lacks something (let's say its to
> > slow), and can't find a different library without these faults, you are
> > so screwed. now you have to re-implement these magic libraries, with all
> > their supported features, without these faults, or otherwise you break a
> > a lot of user programs that built upon these.
> >
> > The current implementation was a safer approach imo. It has it's faults,
> > and did provide me with some very nerve wrecking afternoon's, but I'd
> > feel really uncomfortable relying on some library that i have no control
> > over for the most performance-impacting component.
> >
> > On 31.07.2015 11:18, Maximilian Michels wrote:
> > > py4j looks really nice and the communication works in both ways. There
> is
> > > also another Python to Java communication library called javabridge. I
> > > think it is a pity we chose to implement a proprietary protocol for the
> > > network communication of the Python API. This could have been
> abstracted
> > > more nicely and we have already seen that you can run into problems if
> > you
> > > implement that yourself.
> > >
> > > Serializers could be created dynamically if Python passed its
> dynamically
> > > determined types to Java at runtime. Then everything should work on the
> > > Java side.
> > >
> > > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <trohrmann@apache.org>
> > > wrote:
> > >
> > >> Zeppelin uses py4j [1] to transfer data between a Python process and a
> > JVM.
> > >> That way they can run a Python interpreter and Java interpreter and
> > easily
> > >> share state between them. Spark also uses py4j as a bridge between
> Java
> > and
> > >> Python. However, I don't know for what exactly. And I also don't know
> > >> what's the performance penalty of py4j. But programming is a lot of
> fun
> > >> with it :-)
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> [1] https://www.py4j.org/
> > >>
> > >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <sewen@apache.org>
> > wrote:
> > >>
> > >>> I think in short: Spark never worried about types. It is just
> something
> > >>> arbitrary.
> > >>>
> > >>> Flink worries about types, for memory management.
> > >>>
> > >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
> > >> dynamic.
> > >>> Till' also found a pretty nice way to connect Python and Java in his
> > >>> Zeppelin-based demo at the meetup.
> > >>>
> > >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <
> c.schepler@web.de>
> > >>> wrote:
> > >>>
> > >>>> if its just a single array, how would you define group/sort keys?
> > >>>>
> > >>>>
> > >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> > >>>>
> > >>>>> I think then the Python part would just serialize all the tuple
> > fields
> > >>> to
> > >>>>> a
> > >>>>> big byte array. And all the key fields to another array, so
that
> the
> > >>> java
> > >>>>> side can to comparisons on the whole "key blob".
> > >>>>>
> > >>>>> Maybe it's overly simplistic, but it might work. :D
> > >>>>>
> > >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schepler@web.de>
> > >>> wrote:
> > >>>>> I can see this working for basic types, but am unsure how it
would
> > >> work
> > >>>>>> with Tuples. Wouldn't the java API still need to know the
arity to
> > >>> setup
> > >>>>>> serializers?
> > >>>>>>
> > >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> > >>>>>>
> > >>>>>>> I believe it should be possible to create a special
> PythonTypeInfo
> > >>> where
> > >>>>>>> the python side is responsible for serializing data
to a byte
> array
> > >>> and
> > >>>>>> to
> > >>>>>>
> > >>>>>>> the java side it is just a byte array and all the comparisons
are
> > >> also
> > >>>>>>> performed on these byte arrays. I think partitioning
and sort
> > should
> > >>>>>>>
> > >>>>>> still
> > >>>>>>
> > >>>>>>> work, since the sorting is (in most cases) only used
to group the
> > >>>>>>>
> > >>>>>> elements
> > >>>>>>
> > >>>>>>> for a groupBy(). If proper sort order would be required
this
> would
> > >>> have
> > >>>>>> to
> > >>>>>>
> > >>>>>>> be done on the python side.
> > >>>>>>>
> > >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schepler@web.de
> >
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> To be perfectly honest i never really managed to work
my way
> > through
> > >>>>>>>> Spark's python API, it's a whole bunch of magic
to me; not even
> > the
> > >>>>>>>> general structure is understandable.
> > >>>>>>>>
> > >>>>>>>> With "pure python" do you mean doing everything
in python? as in
> > >> just
> > >>>>>>>> having serialized data on the java side?
> > >>>>>>>>
> > >>>>>>>> I believe the way to do this with Flink is to add
a switch that
> > >>>>>>>> a) disables all type checks
> > >>>>>>>> b) creates serializers dynamically at runtime.
> > >>>>>>>>
> > >>>>>>>> a) should be fairly straight forward, b) on the
other hand....
> > >>>>>>>>
> > >>>>>>>> btw., the Python API itself doesn't require the
type
> information,
> > >> it
> > >>>>>>>> already does the b part.
> > >>>>>>>>
> > >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> > >>>>>>>>
> > >>>>>>>>> That I understand, but could you please tell
me how is this
> done
> > >>>>>>>>> differently in Spark for instance?
> > >>>>>>>>>
> > >>>>>>>>> What would we need to change to make this work
with pure python
> > >> (as
> > >>> it
> > >>>>>>>>> seems to be possible)? This probably have large
performance
> > >>>>>>>>>
> > >>>>>>>> implications
> > >>>>>>> though.
> > >>>>>>>>> Gyula
> > >>>>>>>>>
> > >>>>>>>>> Chesnay Schepler <c.schepler@web.de>
ezt írta (időpont: 2015.
> > >> júl.
> > >>>>>>>> 30.,
> > >>>>>>> Cs,
> > >>>>>>>>> 22:04):
> > >>>>>>>>>
> > >>>>>>>>> because it still goes through the Java API
that requires some
> > kind
> > >>> of
> > >>>>>>>>>> type information. imagine a java api program
where you omit
> all
> > >>>>>>>>>>
> > >>>>>>>>> generic
> > >>>>>>> types, it just wouldn't work as of now.
> > >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Could anyone briefly tell me what exactly
is the reason why
> we
> > >>> force
> > >>>>>>>>>> the
> > >>>>>>>>> users in the Python API to declare types for
operators?
> > >>>>>>>>>>> I don't really understand how this
works in different systems
> > >> but
> > >>> I
> > >>>>>>>>>> am
> > >>>>>>> just
> > >>>>>>>>>>> curious why Flink has types and why
Spark doesn't for
> instance.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If you give me some pointers to read
that would also be fine
> :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank you,
> > >>>>>>>>>>> Gyula
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message