flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Hogan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3333) Documentation about object reuse should be improved
Date Tue, 09 Feb 2016 16:28:18 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139151#comment-15139151

Greg Hogan commented on FLINK-3333:

Apache Flink programs can be written and configured to reduce the number of object allocations
for better performance. User defined functions (like map() or groupReduce()) process many
millions or billions of input and output values. Enabling object reuse and processing mutable
objects improves performance by lowering demand on the CPU cache and Java garbage collector.

Object reuse is disabled by default, with user defined functions generally getting new objects
on each call (or through an iterator). In this case it is safe to store references to the
objects inside the function (for example, in a List).

<'storing values in a list' example>

Apache Flink will chain functions to improve performance when sorting is preserved and the
parallelism unchanged. The chainable operators are Map, FlatMap, Reduce on full DataSet, GroupCombine
on a Grouped DataSet, or a GroupReduce where the user supplied a RichGroupReduceFunction with
a combine method). Objects are passed without copying _even when object reuse is disabled_.

In the chaining case, the functions in the chain are receiving the same object instances.
So the the second map() function is receiving the objects the first map() is returning. This
behavior can lead to errors when the first map() function keeps a list of all objects and
the second mapper is modifying objects. In that case, the user has to manually create copies
of the objects before putting them into the list.

<chainable example>

<discussion of copyable values>

<copyablevalue example>

There is a switch at the ExecutionConfig which allows users to enable the object reuse mode
(enableObjectReuse()). For mutable types, Flink will reuse object instances. In practice that
means that a user function will always receive the same object instance (with its fields set
to new values). The object reuse mode will lead to better performance because fewer objects
are created, but the user has to manually take care of what they are doing with the object

<object reuse example>

> Documentation about object reuse should be improved
> ---------------------------------------------------
>                 Key: FLINK-3333
>                 URL: https://issues.apache.org/jira/browse/FLINK-3333
>             Project: Flink
>          Issue Type: Bug
>          Components: Documentation
>    Affects Versions: 1.0.0
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Blocker
>             Fix For: 1.0.0
> The documentation about object reuse \[1\] has several problems, see \[2\].
> \[1\] https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
> \[2\] https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg/edit

This message was sent by Atlassian JIRA

View raw message