apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pramod Immaneni <pra...@datatorrent.com>
Subject Re: [Discuss] Design of the python execution operator
Date Thu, 14 Dec 2017 17:21:09 GMT
Hi Anath,

Sounds interesting and looks like you have put quite a bit of work on it.
Might I suggest changing the title of 2260 to better fit your proposal and
implementation, mainly so that there is differentiation from 2261.

I wanted to discuss the proposal to use multiple threads in an operator
instance. Unless the execution threads are blocking for some sort of i/o
why would it result in a noticeable performance difference compared to
processing in operator thread and running multiple partitions of the
operator in container local. By running the processing in a separate thread
from the operator lifecycle thread you don't still get away from matching
the incoming data throughput. The checkpoint will act as a time where you
backpressure will start to materialize when the operator would have to wait
for your background processing to complete to guarantee all data till the
checkpoint is processed.

Thanks


On Thu, Dec 14, 2017 at 2:20 AM, Ananth G <ananthg.apex@gmail.com> wrote:

> Hello All,
>
> I would like to submit the design for the Python execution operator before
> I raise the pull request so that I can refine the implementation based on
> feedback. Could you please provide feedback on the design if any and I will
> raise the PR accordingly.
>
> - This operator is for the JIRA ticket raised here
> https://issues.apache.org/jira/browse/APEXMALHAR-2260 <
> https://issues.apache.org/jira/browse/APEXMALHAR-2260>
> - The operator embeds a python interpreter in the operator JVM process
> space and is not external to the JVM.
> - The implementation is proposing the use of Java Embedded Python ( JEP )
> given here https://github.com/ninia/jep <https://github.com/ninia/jep>
> - The JEP engine is under zlib/libpng license. Since this is an approved
> license under https://www.apache.org/legal/resolved.html#category-a <
> https://www.apache.org/legal/resolved.html#category-a> I am assuming it
> is ok for the community to approve the inclusion of this library
> - Python integration is a messy piece due to the nature of dynamic
> libraries. All python libraries need to be natively installed. This also
> means we will not be able bundle python libraries and dependencies as part
> of the build into the target JVM container. Hence this operator has the
> current limitation of the python binaries installed through an external
> process on all of the YARN nodes for now.
> - The JEP maven dependency jar in the POM is a JNI wrapper around the
> dynamic library that is installed externally to the Apex installation
> process on all of the YARN nodes.
> - Hope to take up https://issues.apache.org/jira/browse/APEXCORE-796 <
> https://issues.apache.org/jira/browse/APEXCORE-796> to solve this issue
> in the future.
> - The python operator implementation can be extended to py4J based
> implementation ( as opposed to in-memory model like JEP ) in the future if
> required be. JEP is the implementation based on an in-memory design pattern.
> - The python operator allows for 4 major API patterns
>     - Execute a method call by accepting parameters to pass to the
> interpreter
>     - Execute a python script as given in a file path
>     - Evaluate an expression and allows for passing of variables between
> the java code and the python in-memory interpreter bridge
>     - A handy method wherein a series of instructions can be passed in one
> single java call ( executed as a sequence of python eval instructions under
> the hood )
> - Automatic garbage collection of the variables that are passed from java
> code to the in memory python interpreter
> - Support for all major python libraries. Tensorflow, Keras, Scikit,
> xgboost. Preliminary tests for these libraries seem to work as per code
> here : https://github.com/ananthc/sampleapps/tree/master/apache-
> apex/apexjvmpython <https://github.com/ananthc/
> sampleapps/tree/master/apache-apex/apexjvmpython>
> - The implementation allows for SLA based execution model. i.e. the
> operator is given a chance to execute the python code and if not complete
> within a time out, the operator code returns back null.
> - A tuple that has become a straggler as per previous point will
> automatically be drained off to a different port so that downstream
> operators can still consume the straggler if they want to when the results
> arrive.
> - Because of the nature of python being an interpreter and if a previous
> tuple is being still processed, there is chance of a back pressure pattern
> building up very quickly. Hence this operator works on the concept of a
> worker pool. The Python operator uses a configurable number of worker
> thread each of which embed the Python interpreter within their processing
> space. i.e. it is in fact a collection of python ink memory interpreters
> inside the Python operator implementation.
> - The operator chooses one of the threads at runtime basing on their busy
> state thus allowing for back-pressure issues to be resolved automatically.
> - There is a first class support for Numpy in JEP. Java arrays would be
> convertible to the Python Numpy arrays and vice versa and share the same
> memory addresses for efficiency reasons.
> - The base operator implements dynamic partitioning based on a thread
> starvation policy. At each checkpoint, it checks how much percentage of the
> requests resulted in starved threads and if the starvation exceeds a
> configured percentage, a new instance of the operator is provisioned for
> every such instance of the operator
> - The operator provides the notion of a worker execution mode. There are
> two worker modes that are passed in each of the above calls from the user.
> ALL or ANY.  Because python interpreter is state based engine, a newly
> dynamically partitioned operator might not be in the exact state of the
> remaining operators. Hence the operator has this notion of worker execution
> mode. Any call ( any of the 4 calls mentioned above ) called with ALL
> execution mode will be executed on all the workers of the worker thread
> pool as well as the dynamically portioned instance whenever such an
> instance is provisioned.
> - The base operator implementation has a method that can be overridden to
> implement the logic that needs to be executed for each tuple. The base
> operator default implementation is a simple NO-OP.
> - The operator automatically picks up the least busy of the thread pool
> worker which has JEP embedded in it to execute the call.
> - The JEP based installation will not support non Cpython modules. All of
> the major python libraries are cpython based and hence I believe this is of
> a lesser concern. If we hit a roadblock when a new python library being a
> non-Cpython based library needs to be run, then we could implement the
> ApexPythonEngine interface to something like Py4J which involves
> interprocess communication.
> - The python operator requires the user to set the library path
> java.library.path for the operator to make use of the dynamic libraries of
> the corresponding platform. This has to be passed in as the JVM options.
> Failing to do so will result in the operator failing to load the
> interpreter properly.
> - The supported python versions are 2.7, 3.3 , 3.4 , 3.5 and 3.6. Numpy >=
> 1.7 is supported.
> - There is no support for virtual environments yet. In case of multiple
> python versions on the node, to include the right python version for the
> apex operator, ensure that the environment variables and the dynamic
> library path are set appropriately. This is a workaround and I hope
> APEXCORE-796 will solve this issue as well.
>
>
> Regards,
> Ananth
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message