flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <till.rohrm...@gmail.com>
Subject Re: Handling custom types with Kryo
Date Tue, 20 Jan 2015 10:39:48 GMT
I like the idea to automatically figure out which types are used by a
program and to register them at Kryo. Thus, +1 for this idea.

On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <rmetzger@apache.org>
wrote:

> @Stephan: Yes, you are summarizing it correctly.
> I'll assign FLINK-1417 to myself and implement it as discussed here (once I
> have resolved the other issues assigned to me)
>
> There is one additional point we forgot in the discussion so far: We are
> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
> checked and its registering 51 default serializers and 81 registered types.
>
> I just talked to Till (who implemented the KryoSerializer originally) and
> he also suggests to pool kryo instances using thread-local variables. I'll
> look into that as well once FLINK-1417 has been resolved. I think that
> helps to mitigate the heavy initialization of Kryo (which is inevitable)
>
>
> @Arvid: Our POJO/Types support does explicitly not require our users to
> implement any interfaces, so that option is not feasible.
> The bytecode analysis is not part of the main flink distribution because it
> is using a library with an apache incompatible license.
>
>
>
> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <arvid.heise@gmail.com>
> wrote:
>
> > An alternative way that may not be applicable in your case:
> >
> > For Sopremo, all types implemented a common interface. When a package is
> > loaded, the Sopremo package manager scans the jar and looks for classes
> > implementing the interfaces (quite fast, because not the entire class
> must
> > be loaded). All types implementing the interface are automatically added
> to
> > Kryo with their respective annotated serializers.
> >
> > If you still have bytecode analysis of jobs, you can also statically
> > determine all types that are used, check for default serializers, and
> > maintain only a minimal set of serializers used for this specific job.
> You
> > could already warn for unregistered types before submitting jobs.
> >
> > On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <sewen@apache.org> wrote:
> >
> > > Yes, I agree that the Avro serializer should be available by default.
> > That
> > > is one case of a typical type that should work out of the box, given
> that
> > > we support Avro file formats.
> > >
> > > Let me summarize how I understood that suggestion:
> > >
> > >  - We make Avro available by default by registering a default
> serializer
> > > for the SpecificBase
> > >
> > >  - We create a library of serializers. We do not register them by
> > default.
> > >
> > >  - Via FLINK-1417, we analyze the types. For any (nested) type that we
> > > encounter for which we have a serializer in the library, we register
> that
> > > serializer as the default serializer. Also, for every (nested) type we
> > > encounter, we register a tag at Kryo.
> > >
> > > I like that, it should give a nice and smooth user experience.
> > >
> > > Greetings,
> > > Stephan
> > >
> > >
> > >
> > >
> > > On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <rmetzger@apache.org>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > thank you for putting our discussion to the mailing list. This is
> > indeed
> > > > where such discussions belong. For the others, we started discussing
> > > here:
> > > > https://github.com/apache/flink/pull/304
> > > >
> > > > I think there is one additional approach, which is probably close to
> > (1):
> > > > We only register those serializers by default which we don't see in
> the
> > > > pre-flight phase (I think right now thats only GenericData.Array from
> > > > Avro).
> > > > We would come across all the other classes (Jodatime, Protobuf, Avro,
> > > > Thrift, ...) when traversing the class hierarchy, as proposed in
> > > > FLINK-1417. With this approach, users get the best out-of-the box
> > > > experience and the number of registered classes / serializers is kept
> > at
> > > a
> > > > minimum.
> > > > We can still offer means to register additional serializers (I think
> > > thats
> > > > already merged to master).
> > > >
> > > > My main concern with this particular issue is a good out of the box
> > user
> > > > experience. If there is an issue with type serialization, users will
> > > notice
> > > > it very early. (In my experience people often have their existing
> > > datatypes
> > > > they use with other systems, and they want to continue using them)
> > > > Therefore, I want to put some effort into making it as good as
> > possible.
> > > I
> > > > would actually sacrifice performance over stability/usability here.
> Our
> > > > system is flexible enough to replace it later with a more efficient
> > > > serialization if that becomes an issue. But maybe my suggestion above
> > is
> > > > already sufficient.
> > > >
> > > > We could also think about introducing a configuration variable which
> > > allows
> > > > users to disable the default serializers.
> > > >
> > > >
> > > > Regarding the second question: Is there a downside registering all
> > types
> > > > for tagging? We reduce the overall I/O which is good for performance.
> > > >
> > > > Best,
> > > > Robert
> > > >
> > > >
> > > >
> > > > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <sewen@apache.org>
> > wrote:
> > > >
> > > > > Hi all!
> > > > >
> > > > > We have various pending pull requests that add support for certain
> > > types
> > > > by
> > > > > adding extra kryo serializers.
> > > > >
> > > > > I think we need to decide how we want to handle the support for
> extra
> > > > > types, because more are certainly to come.
> > > > >
> > > > > As I understand it, we have three broad options:
> > > > >
> > > > > (1)
> > > > > Add as many serializers to Kryo by default as possible.
> > > > >  Pro:
> > > > >     - Many types work out of the box
> > > > >  Contra:
> > > > >     - We may eventually overload the kryo registry with serializers
> > > > >       that are not needed for most cases and suffer in performance
> > > > >     - It is hard to guess which types work out of the box
> > > (intransparent)
> > > > >
> > > > >
> > > > > (2)
> > > > > We create a collection of serializers and a registration util.
> > > > > --------
> > > > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > > > >
> > > > > Serializers.registerProtoBufSerializers(env);
> > > > > Serializers.registerJavaUtilSerializers(env);
> > > > > ---------
> > > > > Pro:
> > > > >   - Easy for users
> > > > >   - We can grow the set of supported types very large without
> > > overloading
> > > > > Kryo
> > > > >   - It is transparent what gets registered
> > > > >
> > > > > Contra:
> > > > >   - Not quite as convenient as if things just run
> > > > >
> > > > >
> > > > > (3)
> > > > > We do nothing and let the user create and register whatever is
> > needed.
> > > > >
> > > > > We could have a library and utility for serializers for certain
> > > > libraries.
> > > > > Users could use this to conveniently add serializers for the
> > libraries
> > > > they
> > > > > use.
> > > > > Pro:
> > > > >   - Simple for us ;-)
> > > > > Contra:
> > > > >   - More repeated work for users
> > > > >
> > > > >
> > > > > ========================
> > > > >
> > > > > For approach (1) and (2), there is an orthogonal question of
> whether
> > we
> > > > > want to simply register default serializers (that enable that types
> > > work)
> > > > > or also register types for tags, to speed up the serialization of
> > those
> > > > > types.
> > > > >
> > > > >
> > > > > Greetings,
> > > > > Stephan
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message