flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: Types in the Python API
Date Fri, 31 Jul 2015 08:32:55 GMT
I don't know yet. :D

Maybe the sorting will have to be delegated to python. I don't think it's
possible to always get a meaningful order when only sorting on the
serialized bytes. It should however work for grouping.

On Fri, 31 Jul 2015 at 10:31 Chesnay Schepler <c.schepler@web.de> wrote:

> if its just a single array, how would you define group/sort keys?
>
> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> > I think then the Python part would just serialize all the tuple fields
> to a
> > big byte array. And all the key fields to another array, so that the java
> > side can to comparisons on the whole "key blob".
> >
> > Maybe it's overly simplistic, but it might work. :D
> >
> > On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schepler@web.de> wrote:
> >
> >> I can see this working for basic types, but am unsure how it would work
> >> with Tuples. Wouldn't the java API still need to know the arity to setup
> >> serializers?
> >>
> >> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>> I believe it should be possible to create a special PythonTypeInfo
> where
> >>> the python side is responsible for serializing data to a byte array and
> >> to
> >>> the java side it is just a byte array and all the comparisons are also
> >>> performed on these byte arrays. I think partitioning and sort should
> >> still
> >>> work, since the sorting is (in most cases) only used to group the
> >> elements
> >>> for a groupBy(). If proper sort order would be required this would have
> >> to
> >>> be done on the python side.
> >>>
> >>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schepler@web.de>
> wrote:
> >>>
> >>>> To be perfectly honest i never really managed to work my way through
> >>>> Spark's python API, it's a whole bunch of magic to me; not even the
> >>>> general structure is understandable.
> >>>>
> >>>> With "pure python" do you mean doing everything in python? as in just
> >>>> having serialized data on the java side?
> >>>>
> >>>> I believe the way to do this with Flink is to add a switch that
> >>>> a) disables all type checks
> >>>> b) creates serializers dynamically at runtime.
> >>>>
> >>>> a) should be fairly straight forward, b) on the other hand....
> >>>>
> >>>> btw., the Python API itself doesn't require the type information, it
> >>>> already does the b part.
> >>>>
> >>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>> That I understand, but could you please tell me how is this done
> >>>>> differently in Spark for instance?
> >>>>>
> >>>>> What would we need to change to make this work with pure python
(as
> it
> >>>>> seems to be possible)? This probably have large performance
> >> implications
> >>>>> though.
> >>>>>
> >>>>> Gyula
> >>>>>
> >>>>> Chesnay Schepler <c.schepler@web.de> ezt írta (időpont:
2015. júl.
> >> 30.,
> >>>> Cs,
> >>>>> 22:04):
> >>>>>
> >>>>>> because it still goes through the Java API that requires some
kind
> of
> >>>>>> type information. imagine a java api program where you omit
all
> >> generic
> >>>>>> types, it just wouldn't work as of now.
> >>>>>>
> >>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>> Hey!
> >>>>>>>
> >>>>>>> Could anyone briefly tell me what exactly is the reason
why we
> force
> >>>> the
> >>>>>>> users in the Python API to declare types for operators?
> >>>>>>>
> >>>>>>> I don't really understand how this works in different systems
but I
> >> am
> >>>>>> just
> >>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>
> >>>>>>> If you give me some pointers to read that would also be
fine :)
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Gyula
> >>>>>>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message