beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1146) Decrease spark runner startup overhead
Date Mon, 19 Dec 2016 04:15:58 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760088#comment-15760088
] 

Kenneth Knowles commented on BEAM-1146:
---------------------------------------

The Apex runner uses {{@Bind}} annotations to instruct Kryo to use Java serialization on {{Source}}
fields. This annotation is not functional in the version of Kryo linked to the spark runner.
Is it feasible to upgrade and follow this method?

For Coders, the most robust way to use the functionality already present in Beam is to instruct
Kryo to serialize them via the {{asCloudObject}} plus Jackson method and deserialize via Jackson.

Given the somewhat different approaches here, perhaps this ticket should be split? These really
are totally different, even though they seem similar - a {{Source}} is a language-specific
user-definable function, while a {{Coder}} is a proxy for a language-independent binary format.
That's why coders have a defined language-independent serialization (even though that serialization
is just left over from before Beam at the moment).

> Decrease spark runner startup overhead
> --------------------------------------
>
>                 Key: BEAM-1146
>                 URL: https://issues.apache.org/jira/browse/BEAM-1146
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Aviem Zur
>            Assignee: Amit Sela
>
> BEAM-921 introduced a lazy singleton instantiated once in each machine (driver &
executors) which utilizes reflection to find all subclasses of Source and Coder
> While this is beneficial in it's own right, the change added about one minute of overhead
in spark runner startup time (which cause the first job/stage to take up to a minute).
> The change is in class {{BeamSparkRunnerRegistrator}}
> The reason reflection (specifically reflections library) was used here is because  there
is no current way of knowing all the source and coder classes at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message