flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <c.schep...@web.de>
Subject Re: Types in the Python API
Date Fri, 31 Jul 2015 08:30:43 GMT
if its just a single array, how would you define group/sort keys?

On 31.07.2015 07:03, Aljoscha Krettek wrote:
> I think then the Python part would just serialize all the tuple fields to a
> big byte array. And all the key fields to another array, so that the java
> side can to comparisons on the whole "key blob".
>
> Maybe it's overly simplistic, but it might work. :D
>
> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schepler@web.de> wrote:
>
>> I can see this working for basic types, but am unsure how it would work
>> with Tuples. Wouldn't the java API still need to know the arity to setup
>> serializers?
>>
>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
>>> I believe it should be possible to create a special PythonTypeInfo where
>>> the python side is responsible for serializing data to a byte array and
>> to
>>> the java side it is just a byte array and all the comparisons are also
>>> performed on these byte arrays. I think partitioning and sort should
>> still
>>> work, since the sorting is (in most cases) only used to group the
>> elements
>>> for a groupBy(). If proper sort order would be required this would have
>> to
>>> be done on the python side.
>>>
>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schepler@web.de> wrote:
>>>
>>>> To be perfectly honest i never really managed to work my way through
>>>> Spark's python API, it's a whole bunch of magic to me; not even the
>>>> general structure is understandable.
>>>>
>>>> With "pure python" do you mean doing everything in python? as in just
>>>> having serialized data on the java side?
>>>>
>>>> I believe the way to do this with Flink is to add a switch that
>>>> a) disables all type checks
>>>> b) creates serializers dynamically at runtime.
>>>>
>>>> a) should be fairly straight forward, b) on the other hand....
>>>>
>>>> btw., the Python API itself doesn't require the type information, it
>>>> already does the b part.
>>>>
>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>>>> That I understand, but could you please tell me how is this done
>>>>> differently in Spark for instance?
>>>>>
>>>>> What would we need to change to make this work with pure python (as it
>>>>> seems to be possible)? This probably have large performance
>> implications
>>>>> though.
>>>>>
>>>>> Gyula
>>>>>
>>>>> Chesnay Schepler <c.schepler@web.de> ezt írta (időpont: 2015.
júl.
>> 30.,
>>>> Cs,
>>>>> 22:04):
>>>>>
>>>>>> because it still goes through the Java API that requires some kind
of
>>>>>> type information. imagine a java api program where you omit all
>> generic
>>>>>> types, it just wouldn't work as of now.
>>>>>>
>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>>>> Hey!
>>>>>>>
>>>>>>> Could anyone briefly tell me what exactly is the reason why we
force
>>>> the
>>>>>>> users in the Python API to declare types for operators?
>>>>>>>
>>>>>>> I don't really understand how this works in different systems
but I
>> am
>>>>>> just
>>>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>>>
>>>>>>> If you give me some pointers to read that would also be fine
:)
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Gyula
>>>>>>>
>>


Mime
View raw message