beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chinmay Kolhatkar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-831) ParDo Chaining
Date Wed, 15 Feb 2017 12:17:41 GMT

    [ https://issues.apache.org/jira/browse/BEAM-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867722#comment-15867722
] 

Chinmay Kolhatkar edited comment on BEAM-831 at 2/15/17 12:17 PM:
------------------------------------------------------------------

[~thw], [~jkff], I went through the links provided and understood producer-consumer fusion
and sibling fusion concepts.

I'm currently focusing on producer-consumer fusion optimization. I'm unsure how much good
it is to do sibling fusion for Apex Runner.
To do that here is the approach I'm considering (Still working on a POC yet, so might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will later be used
in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the streams/operators
in topological order and find out adjacent ParDo stages for putting them in either thread
local OR container local and update hte field in streams variable with right Locality.

Only thing that I'm not sure about is when to stop the merging of ParDos... i.e. if the DAG
is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of them... Any
thoughts on this?

Also, has any runner already done this?

I'm also considering to update ApexFlattenOperator and put the streams in ThreadLocal instead
of Default locality. Might be a different Jira for that.

Please share your opinion.


was (Author: chinmay):
[~thw], [~jkff], I wet through the links provided and understood producer-consumer fusion
and sibling fusion concepts.

I'm currently focusing on producer-consumer fusion optimization.
To do that here is the approach I'm considering (Still working on a POC yet, so might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will later be used
in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the streams/operators
in topological order and find out adjacent ParDo stages for putting them in either thread
local OR container local and update hte field in streams variable with right Locality.

Only thing that I'm not sure about is when to stop the merging of ParDos... i.e. if the DAG
is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of them... Any
thoughts on this?

Also, has any runner already done this?

I'm also considering to update ApexFlattenOperator and put the streams in ThreadLocal instead
of Default locality. Might be a different Jira for that.

Please share your opinion.

> ParDo Chaining
> --------------
>
>                 Key: BEAM-831
>                 URL: https://issues.apache.org/jira/browse/BEAM-831
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-apex
>            Reporter: Thomas Weise
>
> Current state of Apex runner creates a plan that will place each operator in a separate
container (which would be processes when running on a YARN cluster). Often the ParDo operators
can be collocated in same thread or container. Use Apex affinity/stream locality attributes
for more efficient execution plan.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message