flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Rethink the "always copy" policy for streaming topologies
Date Fri, 02 Oct 2015 16:30:14 GMT

I think you were a user of the Batch API before we made the non-reuse mode
the default mode.
By now, when you use a GroupReduceFunction or a MapPartitionFunction or so,
you need not do any cloning or copying. All functions that receive groups
will always get fresh elements.

This chaining issue concerns only MapFunction (and FlatMapFunction) where
users keep lists to remember elements across invokations to the MapFunction.

On Fri, Oct 2, 2015 at 6:27 PM, Martin Neumann <mneumann@sics.se> wrote:

> It seems like I'm one of the few people that run into the mutable elements
> trap on the Batch API from time to time. At the moment I always clone when
> I'm not 100% sure to avoid hunting the bugs later. So far I was happy to
> learn that this is not a problem in Streaming, but that's just me.
> When working with groupby and partition functions, its easy to forget that
> there is one class per operator not per partition. So if you write your
> code in the state of mind that each partition is separate object reduce on
> operator level becomes really annoying.
> Since the mapping between partitions and operators is usually hidden, makes
> the debugging harder especially in cases where the test data produces a
> single partition per operator and the real deployment does not.
> *To summarize:*
> I'm not against reusing objects as long as there is something that helps
> ease the pitfalls. This could be coding guidelines, debugging tools or best
> practices.
> On Fri, Oct 2, 2015 at 5:53 PM, Stephan Ewen <sewen@apache.org> wrote:
> > Hi all!
> >
> > Now that we are coming to the next release, I wanted to make sure we
> > finalize the decision on that point, because it would be nice to not
> break
> > the behavior of system afterwards.
> >
> > Right now, when tasks are chained together, the system copies the
> elements
> > always between different tasks in the same chain.
> >
> > I think this policy was established under the assumption that copies do
> not
> > cost anything, given our own test examples, which mainly use immutable
> > types like Strings, boxed primitives, ..
> >
> > In practice, a lot of data types are actually quite expensive to copy.
> >
> > For example, a rather common data type in the event analysis of
> web-sources
> > is JSON Object.
> > Flink treats this as a generic type. Depending on its concrete
> > implementation, Kryo may have perform a serialization copy, which means
> > encoding into bytes (JSON encoding, charset encoding) and decoding again.
> >
> > This has a massive impact on the out-of-the-box performance of the
> system.
> > Given that, I was wondering whether we should set to default policy to
> "not
> > copying".
> >
> > That is basically the behavior of the batch API, and there has so far
> never
> > been an issue with that (people running into the trap of overwritten
> > mutable elements).
> >
> > What do you think?
> >
> > Stephan
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message