beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <>
Subject [jira] [Commented] (BEAM-1146) Decrease spark runner startup overhead
Date Mon, 19 Dec 2016 08:25:58 GMT


Amit Sela commented on BEAM-1146:

As for {{Spark}} and {{Kryo}} - we can't change Spark dependencies easily since Kryo is used
via Twitter's {{Chill}} (Kryo in Scala).

I think this is a larger question about serialization in Beam: 

# User Data - Using Coders (explicit) avoids serialization issues that might occur due to
different types of user data (e.g., for runners using Kryo and a pipeline processing data
that is not Kryo serializable) and provides better performance than (implicit) reflection-based
serialization frameworks (knowing what you serialize ahead is always more efficient, right
# The "Closure" - This is everything that is being sent to the worker to process "bundles";
such as Coders, Sources, etc. I'm suggesting we "tag" them in Beam so it's clear it's expected
to be "shipped" as part of the processing task.

It seems that while user-data is handled properly in Beam, the rest is still somewhat undecided,
which might be fine since it integrates with runners (not pipelines like user data), but I
wonder if we can do more here..


[~aviemzur] suggested that the Spark runner will wrap Sources and Coders so it'll always klnow
to serialize them with the {{JavaSerializer}}, which is a great solution for the Spark runner.



> Decrease spark runner startup overhead
> --------------------------------------
>                 Key: BEAM-1146
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Aviem Zur
>            Assignee: Amit Sela
> BEAM-921 introduced a lazy singleton instantiated once in each machine (driver &
executors) which utilizes reflection to find all subclasses of Source and Coder
> While this is beneficial in it's own right, the change added about one minute of overhead
in spark runner startup time (which cause the first job/stage to take up to a minute).
> The change is in class {{BeamSparkRunnerRegistrator}}
> The reason reflection (specifically reflections library) was used here is because  there
is no current way of knowing all the source and coder classes at runtime.

This message was sent by Atlassian JIRA

View raw message