beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chinmay Kolhatkar (JIRA)" <>
Subject [jira] [Commented] (BEAM-831) ParDo Chaining
Date Wed, 15 Feb 2017 12:16:41 GMT


Chinmay Kolhatkar commented on BEAM-831:

[~thw], [~jkff], I wet through the links provided and understood producer-consumer fusion
and sibling fusion concepts.

I'm currently focusing on producer-consumer fusion optimization.
To do that here is the approach I'm considering (Still working on a POC yet, so might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will later be used
in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the streams/operators
in topological order and find out adjacent ParDo stages for putting them in either thread
local OR container local and update hte field in streams variable with right Locality.

Only thing that I'm not sure about is when to stop the merging of ParDos... i.e. if the DAG
is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of them... Any
thoughts on this?

Also, has any runner already done this?

I'm also considering to update ApexFlattenOperator and put the streams in ThreadLocal instead
of Default locality. Might be a different Jira for that.

Please share your opinion.

> ParDo Chaining
> --------------
>                 Key: BEAM-831
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-apex
>            Reporter: Thomas Weise
> Current state of Apex runner creates a plan that will place each operator in a separate
container (which would be processes when running on a YARN cluster). Often the ParDo operators
can be collocated in same thread or container. Use Apex affinity/stream locality attributes
for more efficient execution plan.  

This message was sent by Atlassian JIRA

View raw message