apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ananth G <ananthg.a...@gmail.com>
Subject [Discuss] Design of the python execution operator
Date Thu, 14 Dec 2017 10:20:46 GMT
Hello All,

I would like to submit the design for the Python execution operator before I raise the pull
request so that I can refine the implementation based on feedback. Could you please provide
feedback on the design if any and I will raise the PR accordingly. 

- This operator is for the JIRA ticket raised here https://issues.apache.org/jira/browse/APEXMALHAR-2260
- The operator embeds a python interpreter in the operator JVM process space and is not external
to the JVM.
- The implementation is proposing the use of Java Embedded Python ( JEP ) given here https://github.com/ninia/jep
- The JEP engine is under zlib/libpng license. Since this is an approved license under https://www.apache.org/legal/resolved.html#category-a
<https://www.apache.org/legal/resolved.html#category-a> I am assuming it is ok for the
community to approve the inclusion of this library  
- Python integration is a messy piece due to the nature of dynamic libraries. All python libraries
need to be natively installed. This also means we will not be able bundle python libraries
and dependencies as part of the build into the target JVM container. Hence this operator has
the current limitation of the python binaries installed through an external process on all
of the YARN nodes for now.
- The JEP maven dependency jar in the POM is a JNI wrapper around the dynamic library that
is installed externally to the Apex installation process on all of the YARN nodes.
- Hope to take up https://issues.apache.org/jira/browse/APEXCORE-796 <https://issues.apache.org/jira/browse/APEXCORE-796>
to solve this issue in the future.
- The python operator implementation can be extended to py4J based implementation ( as opposed
to in-memory model like JEP ) in the future if required be. JEP is the implementation based
on an in-memory design pattern.
- The python operator allows for 4 major API patterns
    - Execute a method call by accepting parameters to pass to the interpreter
    - Execute a python script as given in a file path
    - Evaluate an expression and allows for passing of variables between the java code and
the python in-memory interpreter bridge
    - A handy method wherein a series of instructions can be passed in one single java call
( executed as a sequence of python eval instructions under the hood ) 
- Automatic garbage collection of the variables that are passed from java code to the in memory
python interpreter
- Support for all major python libraries. Tensorflow, Keras, Scikit, xgboost. Preliminary
tests for these libraries seem to work as per code here : https://github.com/ananthc/sampleapps/tree/master/apache-apex/apexjvmpython
- The implementation allows for SLA based execution model. i.e. the operator is given a chance
to execute the python code and if not complete within a time out, the operator code returns
back null.
- A tuple that has become a straggler as per previous point will automatically be drained
off to a different port so that downstream operators can still consume the straggler if they
want to when the results arrive.
- Because of the nature of python being an interpreter and if a previous tuple is being still
processed, there is chance of a back pressure pattern building up very quickly. Hence this
operator works on the concept of a worker pool. The Python operator uses a configurable number
of worker thread each of which embed the Python interpreter within their processing space.
i.e. it is in fact a collection of python ink memory interpreters inside the Python operator
- The operator chooses one of the threads at runtime basing on their busy state thus allowing
for back-pressure issues to be resolved automatically.
- There is a first class support for Numpy in JEP. Java arrays would be convertible to the
Python Numpy arrays and vice versa and share the same memory addresses for efficiency reasons.

- The base operator implements dynamic partitioning based on a thread starvation policy. At
each checkpoint, it checks how much percentage of the requests resulted in starved threads
and if the starvation exceeds a configured percentage, a new instance of the operator is provisioned
for every such instance of the operator
- The operator provides the notion of a worker execution mode. There are two worker modes
that are passed in each of the above calls from the user. ALL or ANY.  Because python interpreter
is state based engine, a newly dynamically partitioned operator might not be in the exact
state of the remaining operators. Hence the operator has this notion of worker execution mode.
Any call ( any of the 4 calls mentioned above ) called with ALL execution mode will be executed
on all the workers of the worker thread pool as well as the dynamically portioned instance
whenever such an instance is provisioned.  
- The base operator implementation has a method that can be overridden to implement the logic
that needs to be executed for each tuple. The base operator default implementation is a simple
- The operator automatically picks up the least busy of the thread pool worker which has JEP
embedded in it to execute the call. 
- The JEP based installation will not support non Cpython modules. All of the major python
libraries are cpython based and hence I believe this is of a lesser concern. If we hit a roadblock
when a new python library being a non-Cpython based library needs to be run, then we could
implement the ApexPythonEngine interface to something like Py4J which involves interprocess
- The python operator requires the user to set the library path java.library.path for the
operator to make use of the dynamic libraries of the corresponding platform. This has to be
passed in as the JVM options. Failing to do so will result in the operator failing to load
the interpreter properly. 
- The supported python versions are 2.7, 3.3 , 3.4 , 3.5 and 3.6. Numpy >= 1.7 is supported.
- There is no support for virtual environments yet. In case of multiple python versions on
the node, to include the right python version for the apex operator, ensure that the environment
variables and the dynamic library path are set appropriately. This is a workaround and I hope
APEXCORE-796 will solve this issue as well. 


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message