flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Horváth <xazax....@gmail.com>
Subject Re: GSoC Project Proposal Draft: Code Generation in Serializers
Date Sun, 29 May 2016 09:59:11 GMT
Hi!

I would like to give you some status updates on the Google Summer of Code
project. I started to implement the proposed features [1].

Status of code generation in general:
* I can compile the generated code using Janino compiler
* I can load the compiled classes and use them
* For some mysterious reason, during deserializing a Janino compiled
object, the readObject method is not invoked. When the same code is
compiled using another compiler it works as intended. I am investigating
this issue. In case any of you have some idea what the problem might be,
don't keep it secret :) While I am trying to solve this issue, I also
continue to work on code generation. I can still test the generated code,
registering it manually.

Status of generated POJO serializers:
* I could use the generated code on the WordCountPojo example.
* Everything is implemented except for copying stateful serializers and
serializing subclasses.
* There are several possible performance advantages of the generated
serializers:
  - The serialization/deserialization of the fields are not in a loop,
giving the JVM better chance to inline and devirtualize
  - Null checks are eliminated for primitive types
  - Subclass checks are eliminated for final classes

Status of generated POJO comparators:
* I started to implement them.

I did some preliminary benchmarks with the generated code using the
WordCountPojo example.
* In the baseline (using the default Flink serializers)
PojoSerializer.deserialize was one of the hottest methods (with over 11
percent sample rate).
* Using the generated serializers, the percentage of samples from
deserialize method went down below 3 percent.
* Very significant amount of the time is spent in comparators, so there are
some potential performance gains there as well.

What's next? I am trying to solve the problem with the readObject method,
and in the meantime I try to get the generated comparators working on the
WordCountPojo example. Once that is done, I will make a more detailed
performance case study. After that I will add support for handling
subclasses in the generated code. At that point the generated code will
have all the required features.

Note: I did not change the serialization format, so the generated code can
work with the default serializers. This is crucial for backward
compatibility with save points.

Regards,
Gábor

[1] https://github.com/Xazax-hun/flink/commits/serializer_codegen

On 23 April 2016 at 10:33, Gábor Horváth <xazax.hun@gmail.com> wrote:

> Hi,
>
> The GSoC project proposal was accepted! Thank you for all your support. I
> will do my best to live up to the challenges and deliver everything that
> way planned for this summer.
>
> Best Regards,
> Gábor
>
> On 20 April 2016 at 16:18, Gábor Horváth <xazax.hun@gmail.com> wrote:
>
>> On the second thought I think you are right. I had the impression that
>> there is cyclic dependency between TypeInformation and the serializers but
>> that is not the case. So there is no rewrite needed for TypeInformation in
>> order to be able to use Scala for serializers.
>>
>> According to the proposal unless someone utilize the annotations the
>> generated serializers would be compatible to the current ones. There could
>> be a configuration option whether to try to make the layout more compact
>> based on annotations.
>>
>> On 20 April 2016 at 16:03, Fabian Hueske <fhueske@gmail.com> wrote:
>>
>>> Why would you need to rewrite the TypeInformation in Scala?
>>> I think we need a way to replace Serializer implementations anyway unless
>>> the generated serializers are compatible to the current ones.
>>>
>>> 2016-04-20 15:53 GMT+02:00 Gábor Horváth <xazax.hun@gmail.com>:
>>>
>>> > Hi Fabian,
>>> >
>>> > I agree that it would be awesome to move this to its own module/plugin.
>>> > However in order to be able to write the code generation in Scala I
>>> would
>>> > need to rewrite the type information to use Scala as well. I think I
>>> will
>>> > not
>>> > have time to do this during the summer, so I think I will stick to
>>> Java and
>>> > this modularization can be done later.
>>> >
>>> > Thanks,
>>> > Gábor
>>> >
>>> > On 19 April 2016 at 11:50, Fabian Hueske <fhueske@gmail.com> wrote:
>>> >
>>> > > Hi Gabor,
>>> > >
>>> > > you are right, a codegen serializer module would depend on
>>> flink-core and
>>> > > in the current design flink-core would need to know about the type
>>> infos
>>> > /
>>> > > serializers / comparators.
>>> > >
>>> > > Decoupling implementations of type info, serializers, and comparators
>>> > from
>>> > > flink-core and resolving the cyclic dependency would be what the
>>> plugin
>>> > > architecture would be for.
>>> > > Maybe this can be done by some mechanism to dynamically load
>>> > > TypeInformations for types with overridden serializers / comparators.
>>> > > This would require some design document and discussion in the
>>> community.
>>> > >
>>> > > Cheers, Fabian
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > 2016-04-18 21:19 GMT+02:00 Gábor Horváth <xazax.hun@gmail.com>:
>>> > >
>>> > > > Unfortunately making code generation a separate module would
>>> introduce
>>> > > > cyclic dependency.
>>> > > > Code generation requires the TypeInfo which is available in
>>> flink-core
>>> > > and
>>> > > > flink-core requires
>>> > > > the generated serializers from the code generation module. Do you
>>> have
>>> > a
>>> > > > solution for this?
>>> > > >
>>> > > > I think if we can come up with a solution I will implement it as a
>>> > > separate
>>> > > > Scala module
>>> > > > otherwise I will stick to Java.
>>> > > >
>>> > > > BR,
>>> > > > Gábor
>>> > > >
>>> > > > On 18 April 2016 at 12:40, Fabian Hueske <fhueske@gmail.com>
>>> wrote:
>>> > > >
>>> > > > > +1 for not mixing Java and Scala in flink-core.
>>> > > > >
>>> > > > > Maybe it makes sense to implement the code generated serializers
>>> /
>>> > > > > comparators as a separate module which can be plugged-in. This
>>> could
>>> > be
>>> > > > > pure Scala.
>>> > > > > In general, I think it would be good to have some kind of
>>> "version
>>> > > > > management" for serializers in place. With features such as
>>> > safepoints
>>> > > > that
>>> > > > > depend on the implementation of serializers, it would be good to
>>> > have a
>>> > > > > mechanism to switch between implementations.
>>> > > > >
>>> > > > > Best, Fabian
>>> > > > >
>>> > > > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <chiwanpark@apache.org>:
>>> > > > >
>>> > > > > > Yes, I know Janino is a pure Java project. I meant if we add
>>> Scala
>>> > > code
>>> > > > > to
>>> > > > > > flink-core, we should add Scala dependency to flink-core and it
>>> > could
>>> > > > be
>>> > > > > > confusing.
>>> > > > > >
>>> > > > > > Regards,
>>> > > > > > Chiwan Park
>>> > > > > >
>>> > > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
>>> > > > balassi.marton@gmail.com>
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > > Chiwan, just to clarify Janino is a Java project. [1]
>>> > > > > > >
>>> > > > > > > [1] https://github.com/aunkrig/janino
>>> > > > > > >
>>> > > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
>>> > > chiwanpark@apache.org>
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > >> I prefer to avoid Scala dependencies in flink-core. If
>>> > flink-core
>>> > > > > > includes
>>> > > > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11)
>>> should
>>> > > be
>>> > > > > > added.
>>> > > > > > >> I think that users could be confused.
>>> > > > > > >>
>>> > > > > > >> Regards,
>>> > > > > > >> Chiwan Park
>>> > > > > > >>
>>> > > > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
>>> > > > > balassi.marton@gmail.com>
>>> > > > > > >> wrote:
>>> > > > > > >>>
>>> > > > > > >>> Hi Gábor,
>>> > > > > > >>>
>>> > > > > > >>> I think that adding the Janino dep to flink-core should be
>>> > fine,
>>> > > as
>>> > > > > it
>>> > > > > > >> has
>>> > > > > > >>> quite slim dependencies [1,2] which are generally
>>> orthogonal to
>>> > > > > Flink's
>>> > > > > > >>> main dependency line (also it is already used elsewhere).
>>> > > > > > >>>
>>> > > > > > >>> As for mixing Scala code that is used from the Java parts
>>> of
>>> > the
>>> > > > same
>>> > > > > > >> maven
>>> > > > > > >>> module I am skeptical. We have seen IDE compilation issues
>>> with
>>> > > > > > projects
>>> > > > > > >>> using this setup and have decided that the community-wide
>>> > > potential
>>> > > > > IDE
>>> > > > > > >>> setup pain outweighs the individual implementation
>>> convenience
>>> > > with
>>> > > > > > >> Scala.
>>> > > > > > >>>
>>> > > > > > >>> [1]
>>> > > > > > >>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
>>> > > > > > >>> [2]
>>> > > > > > >>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>>> > > > > > >>>
>>> > > > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
>>> > > > xazax.hun@gmail.com>
>>> > > > > > >> wrote:
>>> > > > > > >>>
>>> > > > > > >>>> Hi!
>>> > > > > > >>>>
>>> > > > > > >>>> Table API already uses code generation and the Janino
>>> compiler
>>> > > > [1].
>>> > > > > Is
>>> > > > > > >> it a
>>> > > > > > >>>> dependency that is ok to add to flink-core? In case it is
>>> ok,
>>> > I
>>> > > > > think
>>> > > > > > I
>>> > > > > > >>>> will use the same in order to be consistent with the other
>>> > code
>>> > > > > > >> generation
>>> > > > > > >>>> efforts.
>>> > > > > > >>>>
>>> > > > > > >>>> I started to look at the Table API code generation [2]
>>> and it
>>> > > uses
>>> > > > > > Scala
>>> > > > > > >>>> extensively. There are several Scala features that can
>>> make
>>> > Java
>>> > > > > code
>>> > > > > > >>>> generation easier such as pattern matching and string
>>> > > > > interpolation. I
>>> > > > > > >> did
>>> > > > > > >>>> not see any Scala code in flink-core yet. Is it ok to
>>> > implement
>>> > > > the
>>> > > > > > code
>>> > > > > > >>>> generation inside the flink-core using Scala?
>>> > > > > > >>>>
>>> > > > > > >>>> Regards,
>>> > > > > > >>>> Gábor
>>> > > > > > >>>>
>>> > > > > > >>>> [1] http://unkrig.de/w/Janino
>>> > > > > > >>>> [2]
>>> > > > > > >>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>>> > > > > > >>>>
>>> > > > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <
>>> xazax.hun@gmail.com
>>> > >
>>> > > > > wrote:
>>> > > > > > >>>>
>>> > > > > > >>>>> Thank you! I finalized the project.
>>> > > > > > >>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
>>> > > > > balassi.marton@gmail.com>
>>> > > > > > >>>>> wrote:
>>> > > > > > >>>>>
>>> > > > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
>>> > > interface.
>>> > > > I
>>> > > > > > have
>>> > > > > > >>>>>> indicated that I wish to mentor your project, I think
>>> you
>>> > can
>>> > > > hit
>>> > > > > > >>>> finalize
>>> > > > > > >>>>>> on your project there.
>>> > > > > > >>>>>>
>>> > > > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
>>> > > > > > xazax.hun@gmail.com>
>>> > > > > > >>>>>> wrote:
>>> > > > > > >>>>>>
>>> > > > > > >>>>>>> Hi,
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> I have updated this draft to include preliminary
>>> > benchmarks,
>>> > > > > > >> mentioned
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>> interaction of annotations with savepoints, extended it
>>> > with
>>> > > a
>>> > > > > > >>>> timeline,
>>> > > > > > >>>>>>> and some notes about scala case classes.
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> Regards,
>>> > > > > > >>>>>>> Gábor
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <
>>> > xazax.hun@gmail.com
>>> > > >
>>> > > > > > wrote:
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>>> Hi!
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> As far as I can see the formatting was not correct in
>>> my
>>> > > > > previous
>>> > > > > > >>>>>> mail. A
>>> > > > > > >>>>>>>> better formatted version is available here:
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>> > > > > > >>>>>>>> Sorry for that.
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> Regards,
>>> > > > > > >>>>>>>> Gábor
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
>>> > > xazax.hun@gmail.com>
>>> > > > > > >>>> wrote:
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>>> Hi,I did not want to send this proposal out before
>>> the I
>>> > > have
>>> > > > > > some
>>> > > > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on
>>> the
>>> > > > mailing
>>> > > > > > >>>> list
>>> > > > > > >>>>>> (
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>> > > > > > >>>>>>> ),
>>> > > > > > >>>>>>>>> and I wanted to make this information available to be
>>> > able
>>> > > to
>>> > > > > > >>>>>>> incorporate
>>> > > > > > >>>>>>>>> this into that discussion. I have written this draft
>>> with
>>> > > the
>>> > > > > > help
>>> > > > > > >>>> of
>>> > > > > > >>>>>>> Gábor
>>> > > > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every
>>> > suggestion.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> The proposal draft:
>>> > > > > > >>>>>>>>> Code Generation in Serializers and Comparators of
>>> Apache
>>> > > > Flink
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I am doing my last semester of my MSc studies and
>>> I’m a
>>> > > > former
>>> > > > > > GSoC
>>> > > > > > >>>>>>>>> student in the LLVM project. I plan to improve the
>>> > > > > serialization
>>> > > > > > >>>>>> code in
>>> > > > > > >>>>>>>>> Flink during this summer. The current implementation
>>> of
>>> > the
>>> > > > > > >>>>>> serializers
>>> > > > > > >>>>>>> can
>>> > > > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
>>> > > > > performance
>>> > > > > > >>>>>>> problems
>>> > > > > > >>>>>>>>> were also reported on the mailing list recently [1].
>>> I
>>> > plan
>>> > > > to
>>> > > > > > >>>>>> implement
>>> > > > > > >>>>>>>>> code generation into the serializers to improve the
>>> > > > performance
>>> > > > > > (as
>>> > > > > > >>>>>>> Stephan
>>> > > > > > >>>>>>>>> Ewen also suggested.)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks
>>> in
>>> > this
>>> > > > > > >>>> section.
>>> > > > > > >>>>>>>>> Performance problems with the current serializers
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the
>>> fields,
>>> > > > which
>>> > > > > > >>>> is
>>> > > > > > >>>>>>>>>  slow (eg. [2])
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  This is also a serious problem for the comparators
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  When deserializing fields of primitive types (eg.
>>> int),
>>> > > the
>>> > > > > > >>>>>> reusing
>>> > > > > > >>>>>>>>>  overload of the corresponding field serializers
>>> cannot
>>> > > > really
>>> > > > > do
>>> > > > > > >>>>>> any
>>> > > > > > >>>>>>> reuse,
>>> > > > > > >>>>>>>>>  because boxed primitive types are immutable in Java.
>>> > This
>>> > > > > > >>>> results
>>> > > > > > >>>>>> in
>>> > > > > > >>>>>>> lots
>>> > > > > > >>>>>>>>>  of object creations. [3][7]
>>> > > > > > >>>>>>>>>  2.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  The loop to call the field serializers makes virtual
>>> > > > function
>>> > > > > > >>>>>> calls,
>>> > > > > > >>>>>>>>>  that cannot be speculatively devirtualized by the
>>> JVM or
>>> > > > > > >>>> predicted
>>> > > > > > >>>>>>> by the
>>> > > > > > >>>>>>>>>  CPU, because different serializer subclasses are
>>> invoked
>>> > > for
>>> > > > > the
>>> > > > > > >>>>>>> different
>>> > > > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because
>>> the
>>> > > number
>>> > > > > of
>>> > > > > > >>>>>>> iterations
>>> > > > > > >>>>>>>>>  is not a compile time constant.) See also the
>>> following
>>> > > > > > >>>> discussion
>>> > > > > > >>>>>>> on the
>>> > > > > > >>>>>>>>>  mailing list [1].
>>> > > > > > >>>>>>>>>  3.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  A POJO field can have the value null, so the
>>> serializer
>>> > > > > inserts
>>> > > > > > >>>> 1
>>> > > > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
>>> > > > extractor
>>> > > > > > >>>>>> logic
>>> > > > > > >>>>>>> does
>>> > > > > > >>>>>>>>>  not distinguish between primitive types and their
>>> boxed
>>> > > > > > >>>> versions,
>>> > > > > > >>>>>> so
>>> > > > > > >>>>>>> even
>>> > > > > > >>>>>>>>>  an int field has a null tag.)
>>> > > > > > >>>>>>>>>  4.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of
>>> every
>>> > > POJO
>>> > > > > > >>>>>>>>>  5.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
>>> > > > > > >>>>>>>>>  Knowing the size of a type when serialized has
>>> numerous
>>> > > > > > >>>>>> performance
>>> > > > > > >>>>>>>>>  benefits throughout Flink:
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Sorters can do in-place, when the type is small
>>> [5]
>>> > > > > > >>>>>>>>>     2.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Chaining hash tables do not need resizes, because
>>> > they
>>> > > > know
>>> > > > > > >>>> how
>>> > > > > > >>>>>>>>>     many buckets to allocate upfront [6]
>>> > > > > > >>>>>>>>>     3.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Different hash table architectures could be
>>> used, eg.
>>> > > > open
>>> > > > > > >>>>>>>>>     addressing with linear probing instead of some
>>> > chaining
>>> > > > > > >>>>>>>>>     4.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     It is possible to deserialize, modify, and then
>>> > > serialize
>>> > > > > > >>>> back
>>> > > > > > >>>>>> a
>>> > > > > > >>>>>>>>>     record to its original place, because it cannot
>>> > happen
>>> > > > that
>>> > > > > > >>>> the
>>> > > > > > >>>>>>> modified
>>> > > > > > >>>>>>>>>     version does not fit in the place allocated
>>> there for
>>> > > the
>>> > > > > old
>>> > > > > > >>>>>>> version (see
>>> > > > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for
>>> concrete
>>> > > > > > >>>> instances
>>> > > > > > >>>>>> of
>>> > > > > > >>>>>>> this
>>> > > > > > >>>>>>>>>     problem)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
>>> > > > > > PojoSerializer,
>>> > > > > > >>>>>> but
>>> > > > > > >>>>>>>>> also with the TupleSerializer.
>>> > > > > > >>>>>>>>> Solution approaches
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Run time code generation for every POJO
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
>>> > > > > serializers
>>> > > > > > >>>>>> for
>>> > > > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for
>>> example,
>>> > > > > > >>>>>> Javassist)
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     2. also needs code generation, and also some
>>> extra
>>> > > effort
>>> > > > > in
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     type extractor to distinguish between primitive
>>> types
>>> > > and
>>> > > > > > >>>> their
>>> > > > > > >>>>>>> boxed
>>> > > > > > >>>>>>>>>     versions
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     could be used for PojoComparator as well (which
>>> could
>>> > > > > greatly
>>> > > > > > >>>>>>>>>     increase the performance of sorting)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Annotations on POJOs (by the users)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Concretely:
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        annotate fields that will never be nulls -> no
>>> > null
>>> > > > tag
>>> > > > > > >>>>>> needed
>>> > > > > > >>>>>>>>>        before every field!
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        annotating a POJO that it will not be null ->
>>> no
>>> > top
>>> > > > > level
>>> > > > > > >>>>>> null
>>> > > > > > >>>>>>>>>        tag needed
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     These would also help with the getLength problem
>>> > (6.),
>>> > > > > > >>>> because
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     length is often not known because currently
>>> anything
>>> > > can
>>> > > > be
>>> > > > > > >>>>>> null
>>> > > > > > >>>>>>> or a
>>> > > > > > >>>>>>>>>     subclass can appear anywhere
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     These annotations could be done without code
>>> > > generation,
>>> > > > > but
>>> > > > > > >>>>>> then
>>> > > > > > >>>>>>>>>     they would add some overhead when there are no
>>> > > > annotations
>>> > > > > > >>>>>>> present, so this
>>> > > > > > >>>>>>>>>     would work better together with the code
>>> generation
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Tuples would become a special case of POJOs,
>>> where
>>> > > > nothing
>>> > > > > > >>>> can
>>> > > > > > >>>>>> be
>>> > > > > > >>>>>>>>>     null, and no subclass can appear, so maybe we
>>> could
>>> > > > > eliminate
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     TupleSerializer
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     We could annotate some internal types in Flink
>>> > > libraries
>>> > > > > > >>>> (Gelly
>>> > > > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> TODO: what is the situation with Scala case classes?
>>> Run
>>> > > time
>>> > > > > > code
>>> > > > > > >>>>>>>>> generation is probably easier in Scala? (with
>>> > quasiquotes)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> About me
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I am in the last year of my Computer Science MSc
>>> studies
>>> > at
>>> > > > > > Eotvos
>>> > > > > > >>>>>>> Lorand
>>> > > > > > >>>>>>>>> University in Budapest, and planning to start a PhD
>>> in
>>> > the
>>> > > > > > autumn.
>>> > > > > > >>>> I
>>> > > > > > >>>>>>> have
>>> > > > > > >>>>>>>>> been working for almost three years at Ericsson on
>>> static
>>> > > > > > analysis
>>> > > > > > >>>>>> tools
>>> > > > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on
>>> the
>>> > > LLVM
>>> > > > > > >>>> project,
>>> > > > > > >>>>>>> and I
>>> > > > > > >>>>>>>>> am a frequent contributor ever since. The next
>>> summer I
>>> > was
>>> > > > > > >>>>>> interning at
>>> > > > > > >>>>>>>>> Apple.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I learned about the Flink project not too long ago
>>> and I
>>> > > like
>>> > > > > it
>>> > > > > > so
>>> > > > > > >>>>>> far.
>>> > > > > > >>>>>>>>> The last few weeks I was working on some tickets to
>>> > > > familiarize
>>> > > > > > >>>>>> myself
>>> > > > > > >>>>>>> with
>>> > > > > > >>>>>>>>> the codebase:
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> My CV is available here:
>>> > > > > > http://xazax.web.elte.hu/files/resume.pdf
>>> > > > > > >>>>>>>>> References
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [1]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [2]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [3]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [4]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [5]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [6]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>> > > > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Best Regards,
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Gábor
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > > >>
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message