flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Saputra <henry.sapu...@gmail.com>
Subject Re: [DISCUSS] Inconsistent naming of intermediate results
Date Tue, 31 Mar 2015 18:16:24 GMT
As one of the devs that recently been tracing the runtime portion of
the code +1 for renaming for inlining with the concepts.

One thing I would like to have is immediate change to the
documentation [1] with renaming PR . Otherwise

Then need to file followup ticket to update Kostas' awesome wiki page [2].

- Henry

[1] http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
[2] https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uce@apache.org> wrote:
> On a high level we call intermediate data produced by programs "intermediate results".
For example in a WordCount map-reduce program the map function produces an intermediate result,
which consists of (word, 1) pairs and the reduce function consumes this intermediate result.
Kostas has recently added documentation explaining the core concepts [1].
>
> The naming of classes related to intermediate results is inconsistent (and probably confusing).
>
> - In JobGraphs (internal low-level API to define programs) they are called IntermediateDataSet
and identified by IntermediateDataSetIDs.
>
> - In ExecutionGraphs (JobManager structure used for state tracking/scheduling) they are
called IntermediateResult at the ExecutionJobVertex (identified by IntermediateDataSetID)
and IntermediateResultPartition at the ExecutionVertex (identified by IntermediateResultPartitionID).
>
> - At runtime (TaskManager) they are called ResultPartition and identified by ResultPartitionID
(composition of ExecutionAttemptID and IntermediateResultPartitionID). These are further subpartitioned
into ResultSubpartition instances.
>
> I propose to get the naming more in line with the existing naming scheme and prefix it
with the corresponding managemenet structures:
>
> 1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
> 2) IntermediateResult => ExecutionJobVertexResult (identified by JobVertexResultID)
> 3) IntermediateResultPartition => ExecutionVertexResult (identified by ExecutionVertexResultID)
> 4) ResultPartition => Result
> 5) ResultSubpartition => ResultPartition
>
> These names are non-user facing, but still at the core of the system. I think that consistent
naming of these classes will make it easier for new contributors to get an overview of how
single components relate to each other (the prefixes indicate this). In the docs, we can still
refer to the high-level concept as "intermediate results".
>
> What's your opinion on this? I think now is a good time to think about this stuff, because
the core classes have only been added recently to the system. Feel free to propose alternatives.
:-)
>
> – Ufuk
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Mime
View raw message