flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Variable Tuple Type
Date Tue, 06 Dec 2016 19:52:48 GMT
Hi Max,

Tuples in Flink are of fixed length. You can define your own data types and
serializers, but this is not the easiest solution.

I would go for Array types, especially if your data can be primitive types
(long).
The serializer for primitive arrays should be almost as efficient as the
Tuple serializers. The only overhead is serializing the length of the array
which are 4 bytes.

Best, Fabian


2016-12-05 17:28 GMT+01:00 Max Kie├čling <mailing@kopfueber.org>:

> Hey,
>
> for a project we need to represent data as lists. So each entry in the
> DataSets basically holds a list of basic data type elements. When
> processing the data we keep joining lists of the same shape and so the
> list size grows over time
>
> e.g. (a,b,c) x (c,d,e) -> (a,b,c,d,e)
>
> Currently our solution basically is
> to use a `DataSet<List<Long>>`.
>
> The problem with this is, that the performance seems to be quite poor
> compared to using tuples. When we compare the same job using either
> Tuples or Lists, Tuples seem to be 2-10 times faster.
>
> However since we can't know in advance how many elements the list will
> have for a given job, using the Tuple0-25 would be both cumbersome and
> complex, especially if the list size outgrows 25.
> Using the Record class, the results look promising but using Tuples is
> still double as fast.
>
> In general our tests yield that the runtime is
> Tuple < Array < Record < List
>
> So my question is, do you see a possible way to create a variable length
> tuple type which can grow almost indefinitely while keeping most of the
> benefits of the TupleXX classes but skipping lots of the overhead of
> Record (like keeping track of possible null values etc)
>
> Thanks a lot
> Best Max
>

Mime
View raw message