lucene-pylucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <roman.ch...@gmail.com>
Subject Re: call python from java - what strategy do you use?
Date Wed, 12 Jan 2011 10:05:53 GMT
Hi Andi,

I think I will give it a try, if only because I am curious. Please see
one remaining question below.


On Tue, Jan 11, 2011 at 10:37 PM, Andi Vajda <vajda@apache.org> wrote:
>
>
> On Tue, 11 Jan 2011, Roman Chyla wrote:
>
>> Hi Andy,
>>
>> This is much more than I could have hoped! Just yesterday, I was
>> looking for ways how to embed Python VM in Jetty, as that would be
>> more natural, but found only jepp.sourceforge.net and off-putting was
>> the necessity to compile it against the newly built python. I could
>> not want it from the guys who may need my extension. And I realize
>> only now, that embedding Python in Java is even documented on the
>> website, but honestly i would not know how to do it without your
>> detailed examples.
>>
>> Now to the questions, I apologize, some of them or all must seem very
>> stupid to you
>>
>> - pylucene is used on many platforms and with jcc always worked as
>> expected (i love it!), but is it as reliable in the opposite
>> direction? The PythonVM.java loads "jcc" library, so I wonder if in
>> principle there is any difference in the directionality - but I am not
>> sure. To rephrase my convoluted question: would you expect this
>> wrapping be as reliable as wrapping java inside python is now?
>
> I've been using this for over two years, in production.
> My main worry was memory leaks because a server process is expected to stay
> up and running for weeks at a time and it's been very stable on that front
> too. Of course, when there is a bug somewhere that causes your Python VM to
> crash, the entire server crashes. Just like when the JVM crashes (which is
> normally rare). In other words, this isn't any less reliable than a
> standalone Python VM process. It can be tricky, but is possible, to run gdb,
> pdb and jdb together to step through the three languages involved, python,
> java and C++. I've had to do this a few times but not in a long time.
>
>> - in the past, i built jcc libraries on one host and distributed them on
>> various machines. As long the family OS and the python main version were the
>> same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does
>> not change, or will it be dependent on the python against which the egg was
>> built?
>
> Distributing binaries is risky. The same caveats apply. I wouldn't do it,
> even in the simple PyLucene case.

unfortunately, I don't have that many choices left - this is not for
some client-software scenario, we are running the jobs on the grid,
and there I cannot compile the binaries. So, if previously the
location of the python interpreter or python minor version did not
cause problems, now perhaps it will be different. But that wasn't for
the Solr, wrapping Solr is not meant for the grid.

>
>> - now a little tricky issue; when I wrap jetty inside python, I hoped
>> to build it in a shared mode with lucene to be able to do some
>> low-level lucene indexing tasks from inside Python. If I do the
>> opposite and wrap Python VM in Java, I would still like to access the
>> lucene (which is possible, as I see well from your examples) But on
>> the python side, you are calling initVM() - will the initVM() call
>> create a new Java VM or will it access the parent Java VM which
>> started it?
>
> No, initVM() in this case just initializes your egg and adds its stuff to
> the CLASSPATH. No Java VM init is done. As with any shared-mode JCC-built
> extension, all calls to initVM() but the first one just do that.
> The first call to initVM() in the embedding Python case is like that too
> because there already is a Java VM running when PythonVM is instantiated and
> called.

And if in the python, I will do:

import lucene
import lucene.initVM(lucene.CLASSPATH)

Will it work in this case? Giving access to the java classes from
inside python. Or I will have to forget pylucene, and prepare some
extra java classes? (the jcc in reverse trick, as you put it)

>
>> - you say that threads are not managed by the Python VM, does that
>> mean there is no Python GIL?
>
> No, there is a Pythonn GIL (and that is the Achille's Heel of this setup if
> you expect high concurrent servlet performance from your server calling
> Python). That Python GIL is connected to this thread state I was mentioning
> earlier. Because the thread is not managed by Python, when Python is called
> (by way of the code generated by JCC) it doesn't find a thread state for the
> thread and creates one. When the call completes, the thread state is
> destroyed because its refcount goes to zero. My TerminatingThread class
> acquires a Python thread state and keeps it for the life of the thread,
> thereby working this problem around.

OK, this then looks like a normal Python - which is somehow making me
less worried :) I wanted to use multiprocessing inside python to deal
with GIL, and I see no reason why it should not work in this case.

Thank you very much.
Cheers,

  roman

>
>> - I don't really know what is exactly in the python thread local
>> storage, could that somehow negatively affect the Python process if
>> acquireThreadState/releaseThreadState are not called?
>
> Yes, if you depend on thread-local storage, it would get lost between calls
> and cause confusion and bugs, defeating its purpose. Python's thread-local
> storage support is documented here:
> http://docs.python.org/library/threading.html, look for threading.local.
>
> Andi..
>
>>
>> Thank you.
>>
>> Cheers,
>>
>>  roman
>>
>>
>> On Tue, Jan 11, 2011 at 8:13 PM, Andi Vajda <vajda@apache.org> wrote:
>>>
>>>  Hi Roman,
>>>
>>> On Tue, 11 Jan 2011, Roman Chyla wrote:
>>>
>>>> I have recently wrapped solr inside jetty with JCC (we need to access
>>>> very big result sets quickly, via JNI, but also keep solr running as
>>>> normal) and was wondering what strategies do you guys use to speak
>>>> *from inside* Java towards the Python end.
>>>>
>>>> So far, I was able to think about these:
>>>>
>>>> - raise exceptions in java and catch in python (I think I have seen
>>>> this in some posts from Bill Jansen)
>>>> - communicate via sockets
>>>> - wait passively - call some java method and wait for its return
>>>> - monitor actively - in python check in loop some java object
>>>>
>>>> Is there something else?
>>>
>>> I'm not sure I completely understand your questions but if what you're
>>> asking is how to run Python code from inside a Java servlet container,
>>> that
>>> I've done with Tomcat and Lucene.
>>>
>>> Basically, instead of embedding a JVM inside a Python VM - as is done for
>>> PyLucene - you do the opposite, you embed a Python VM inside a JVM.
>>>
>>> For that purpose, see the org.apache.jcc.PythonVM class available in
>>> JCC's
>>> java tree. This class must be instantiated from the main thread at Java
>>> servlet engine startup time. In Tomcat, I patched some startup code, in
>>> BootStrap.java (see patches below) for this purpose.
>>>
>>> Then, to make some Python code accessible from Java, use the usual way of
>>> writing "extensions", the so-called JCC in reverse trick. Define a Java
>>> class
>>> with some native methods implemented in Python; define a Python class
>>> that
>>> "extends" it; build the Java class into a JAR; include it into a
>>> JCC-built
>>> egg; install the egg into Python's env (site-packages, PYTHONPATH,
>>> whatever);
>>> Then, write servlet code in Java that imports your Java class and calls
>>> it.
>>>
>>> As you can see, this sounds simple but the devil is in the details. Of
>>> course,
>>> bending Jetty for this may have different requirements but the code
>>> snippets
>>> below should give you a good idea about what's required.
>>>
>>> This approach has been in production running the freebase.com's search
>>> server
>>> for over two years now.
>>>
>>> If you have questions, of course, please ask.
>>> Good luck !
>>>
>>> Andi..
>>>
>>> ----------------------
>>> Patch to Bootstrap.java to use JCC's PythonVM (which initializes the
>>> embedded
>>> Python VM)
>>>
>>> ---
>>> apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java
>>>    2010-07-19 06:02:32.000000000 -0700
>>> +++
>>>
>>> apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java.patched
>>>    2010-08-04 08:49:05.000000000 -0700
>>> @@ -30,16 +30,18 @@
>>>  import javax.management.MBeanServer;
>>>  import javax.management.MBeanServerFactory;
>>>  import javax.management.ObjectName;
>>>
>>>  import org.apache.catalina.security.SecurityClassLoad;
>>>  import org.apache.juli.logging.Log;
>>>  import org.apache.juli.logging.LogFactory;
>>>
>>> +import org.apache.jcc.PythonVM;
>>> +
>>>
>>>  /**
>>>  * Boostrap loader for Catalina.  This application constructs a class
>>> loader
>>>  * for use in loading the Catalina internal classes (by accumulating all
>>> of
>>> the
>>>  * JAR files found in the "server" directory under "catalina.home"), and
>>>  * starts the regular execution of the container.  The purpose of this
>>>  * roundabout approach is to keep the Catalina internal classes (and any
>>>  * other classes they depend on, such as an XML parser) out of the system
>>> @@ -398,22 +400,24 @@
>>>         try {
>>>             String command = "start";
>>>             if (args.length > 0) {
>>>                 command = args[args.length - 1];
>>>             }
>>>
>>>             if (command.equals("startd")) {
>>>                 args[args.length - 1] = "start";
>>> +                PythonVM.start("mql");
>>>                 daemon.load(args);
>>>                 daemon.start();
>>>             } else if (command.equals("stopd")) {
>>>                 args[args.length - 1] = "stop";
>>>                 daemon.stop();
>>>             } else if (command.equals("start")) {
>>> +                PythonVM.start("mql");
>>>                 daemon.setAwait(true);
>>>                 daemon.load(args);
>>>                 daemon.start();
>>>             } else if (command.equals("stop")) {
>>>                 daemon.stopServer(args);
>>>             } else {
>>>                 log.warn("Bootstrap: command \"" + command + "\" does
not
>>> exist.");
>>>             }
>>>
>>> -----------------------------------------
>>> Define a Java class:
>>>
>>> package ....
>>>
>>> public class EMQL {
>>>
>>>    private long pythonObject;
>>>
>>>    public EMQL()
>>>    {
>>>    }
>>>
>>>    public void pythonExtension(long pythonObject)
>>>    {
>>>        this.pythonObject = pythonObject;
>>>    }
>>>    public long pythonExtension()
>>>    {
>>>        return this.pythonObject;
>>>    }
>>>
>>>    public void finalize()
>>>        throws Throwable
>>>    {
>>>        pythonDecRef();
>>>    }
>>>
>>>    public native void pythonDecRef();
>>>
>>>    // the methods implemented in python
>>>    public native String init(ME me);
>>>    public native String emql_refresh(String tid, String type);
>>>    public native String emql_status();
>>>
>>>    etc .......... etc
>>>
>>> ------------------------------------
>>> The corresponding Python class
>>>
>>> import ......
>>>
>>> from jemql import initVM, CLASSPATH, EMQL
>>>
>>> initVM(CLASSPATH)
>>>
>>> class emql(EMQL):
>>>
>>>    def __init__(self):
>>>        super(emql, self).__init__()
>>>
>>>    def init(self, me):
>>>     ...........
>>>    def emql_refresh(self, tid, type):
>>>     ...........
>>>    def emql_status(self):
>>>     ...........
>>>       return "some status"
>>>
>>>    etc ...... etc
>>>
>>> ------------------------------------
>>> Makefile rules to build this via JCC (the jemql.egg file is just an empty
>>> target file for Makefile, it's not used for anything else):
>>>
>>> default: jemql.egg
>>>
>>> jemql.jar: java/org/blah/blah/EMQL.java
>>>        mkdir -p classes
>>>        javac -classpath $(CLASSPATH):$(MORE_CLASSPATH):$(etc..etc) -d
>>> classes $(JAVAC_FLAGS) $<
>>>        jar -cvf $@ -C classes .
>>>
>>> jemql.egg: jemql.jar $(JMQL_JAR) emql.py
>>>        $(JCC) --version 1.0 --jar $< \
>>>               --classpath $(CLASSPATH):$(JME_JAR):$(JMQL_JAR) \
>>>               org.blah.blah.me.ME \
>>>               --package java.lang \
>>>               --python jemql --build $(DBG_FLAGS) \
>>>               --install \
>>>               --module emql
>>>        touch $@
>>> ------------------------------------
>>> Patch to Tomcat's build.xml ANT script to add JCC's classes (like
>>> PythonVM)
>>> to
>>> the build classpath.
>>>
>>> --- apache-tomcat-6.0.29-src/build.xml  2010-07-19 06:02:31.000000000
>>> -0700
>>> +++ apache-tomcat-6.0.29-src/build.xml.patched  2010-08-04
>>> 09:30:24.000000000 -0700
>>> @@ -95,16 +95,17 @@
>>>   <property name="jasper-jdt.jar"
>>> value="${jasper-jdt.home}/jasper-jdt.jar"/>
>>>   <available property="tomcat-dbcp.present" file="${tomcat-dbcp.jar}" />
>>>   <available property="jdk16.present"
>>> classname="javax.sql.StatementEvent"
>>> />
>>>
>>>   <!-- Classpath -->
>>>   <path id="tomcat.classpath">
>>>     <pathelement location="${ant.jar}"/>
>>>     <pathelement location="${jdt.jar}"/>
>>> +    <pathelement location="${jcc.egg}/jcc/classes"/>
>>>   </path>
>>>
>>>   <!-- Version info filter set -->
>>>   <tstamp>
>>>     <format property="TODAY" pattern="MMM d yyyy" locale="en"/>
>>>     <format property="TSTAMP" pattern="hh:mm:ss"/>
>>>   </tstamp>
>>>   <filterset id="version.filters">
>>> @@ -148,16 +149,25 @@
>>>            excludes="**/CVS/**,**/.svn/**"
>>>            encoding="ISO-8859-1">
>>>  <!-- Comment this in to show unchecked warnings:
>>>       <compilerarg value="-Xlint:unchecked"/>
>>>  -->
>>>       <classpath refid="tomcat.classpath" />
>>>       <exclude name="org/apache/naming/factory/webservices/**" />
>>>     </javac>
>>> +    <javac srcdir="${extras.path}" destdir="${tomcat.classes}"
>>> +           debug="${compile.debug}"
>>> +           deprecation="${compile.deprecation}"
>>> +           source="${compile.source}"
>>> +           optimize="${compile.optimize}"
>>> +           excludes="**/CVS/**,**/.svn/**">
>>> +<!-- Comment this in to show unchecked warnings:     <compilerarg
>>> value="-Xlint:unchecked"/> -->
>>> +      <classpath refid="tomcat.classpath" />
>>> +    </javac>
>>>     <!-- Copy static resource files -->
>>>     <copy todir="${tomcat.classes}" encoding="ISO-8859-1">
>>>       <filterset refid="version.filters"/>
>>>       <fileset dir="java">
>>>         <include name="**/*.properties"/>
>>>         <include name="**/*.dtd"/>
>>>         <include name="**/*.tasks"/>
>>>         <include name="**/*.xsd"/>
>>>
>>> -----------------------------------------------
>>> Patch to catalina.sh, the Tomcat startup script to add JCC to LIBPATH and
>>> CLASSPATH
>>>
>>> --- apache-tomcat-6.0.29-src/output/build/bin/catalina.sh
>>> 2010-08-04
>>> 09:57:27.000000000 -0700
>>> +++ apache-tomcat-6.0.29-src/output/build/bin/catalina.sh.patched
>>> 2010-08-04 09:57:47.000000000 -0700
>>> @@ -162,16 +162,30 @@
>>>     exit 1
>>>   fi
>>>  fi
>>>
>>>  if [ -z "$CATALINA_BASE" ] ; then
>>>   CATALINA_BASE="$CATALINA_HOME"
>>>  fi
>>>
>>> +if [ -n "$JCC_EGG" ]; then
>>> +  CLASSPATH="$CLASSPATH":"$JCC_EGG"/jcc/classes
>>> +  JAVA_LIB_PATH=$JCC_EGG
>>> +fi
>>> +if [ -n "$TOMCAT_APR_LIB_PATH" ]; then
>>> +  JAVA_LIB_PATH=$JAVA_LIB_PATH:$TOMCAT_APR_LIB_PATH
>>> +fi
>>> +if [ -n "$JAVA_LIB_PATH" ]; then
>>> +  JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$JAVA_LIB_PATH"
>>> +fi
>>> +if [ -n "EXTRA_CLASSPATH" ]; then
>>> +  CLASSPATH="$CLASSPATH":"$EXTRA_CLASSPATH"
>>> +fi
>>> +
>>>  # Add tomcat-juli.jar and bootstrap.jar to classpath
>>>  # tomcat-juli.jar can be over-ridden per instance
>>>  if [ ! -z "$CLASSPATH" ] ; then
>>>   CLASSPATH="$CLASSPATH":
>>>  fi
>>>  if [ "$CATALINA_BASE" != "$CATALINA_HOME" ] && [ -r
>>> "$CATALINA_BASE/bin/tomcat-juli.jar" ] ; then
>>>
>>>
>>> CLASSPATH="$CLASSPATH""$CATALINA_BASE"/bin/tomcat-juli.jar:"$CATALINA_HOME"/bin/bootstrap.jar
>>>  else
>>>
>>> These EGG paths are long, complicated and OS-specific, the trick below
>>> generates them programmatically (from inside a Makefile):
>>>
>>> JCC_EGG:=$(shell $(PYTHON) -c "import os, jcc; print
>>> os.path.dirname(os.path.dirname(jcc.__file__))")
>>> JEMQL_EGG:=$(shell $(PYTHON) -c "import os, jemql; print
>>> os.path.dirname(os.path.dirname(jemql.__file__))")
>>>
>>> Then, the CLASSPATH addition during _build_ time:
>>>  CLASSPATH = $(CLASSPATH):$(JEMQL_EGG)/jemql/jemql.jar
>>> and so on...
>>> At runtime, JCC takes care of adding your eggs to the startup CLASSPATH.
>>>
>>> ----------------------------------------------
>>> Last but not least, if you use Python's thread local storage in your
>>> threads, Python threads when embedded inside a JVM are 'dummy', that is,
>>> while they're
>>> backed by the actual Java thread (a pthread), the Python VM is not
>>> managing
>>> them and a thread state object is created each and every time a Python
>>> thread
>>> is entered and released when exited back to the JVM. This has two
>>> problems:
>>>  1. it's a bit wasteful
>>>  2. python thread local storage gets lost
>>>
>>> The Java class below works this around by incrementing the refcount that
>>> controls this:
>>>
>>> package org.apache.catalina.core;
>>>
>>> import org.apache.jcc.PythonVM;
>>>
>>> public class TerminatingThread extends Thread {
>>>    protected Runnable runnable;
>>>
>>>    public TerminatingThread(ThreadGroup group, Runnable runnable, String
>>> name)
>>>    {
>>>        super(group, name);
>>>        this.runnable = runnable;
>>>    }
>>>
>>>    public void run()
>>>    {
>>>        PythonVM vm = PythonVM.get();
>>>
>>>        try {
>>>            vm.acquireThreadState();
>>>            runnable.run();
>>>        } finally {
>>>            vm.releaseThreadState();
>>>        }
>>>    }
>>> }
>>>
>>> Then, there is some trickery to get Tomcat to use this class for its
>>> threads
>>> instead of the default one:
>>>
>>> ---
>>>
>>> apache-tomcat-6.0.29-src/java/org/apache/catalina/core/StandardThreadExecutor.java
>>>  2010-07-19 06:02:32.000000000 -0700
>>> +++
>>>
>>> apache-tomcat-6.0.29-src/java/org/apache/catalina/core/StandardThreadExecutor.java.patched
>>>  2010-08-04 08:56:02.000000000 -0700
>>> @@ -44,17 +44,17 @@
>>>     protected int minSpareThreads = 25;
>>>
>>>     protected int maxIdleTime = 60000;
>>>
>>>     protected ThreadPoolExecutor executor = null;
>>>
>>>     protected String name;
>>>
>>> -    private LifecycleSupport lifecycle = new LifecycleSupport(this);
>>> +    protected LifecycleSupport lifecycle = new LifecycleSupport(this);
>>>     // ---------------------------------------------- Constructors
>>>     public StandardThreadExecutor() {
>>>         //empty constructor for the digester
>>>     }
>>>
>>>
>>>
>>>     // ---------------------------------------------- Public Methods
>>>
>>>
>>> In Tomcat's server.xml, use this executor (and code below for it)
>>>    <Executor name="relThreadPool"
>>>
>>>  className="org.apache.catalina.core.TerminatingThreadExecutor"
>>>              namePrefix="rel-exec-"
>>>              maxIdleTime="3600000"
>>>              minSpareThreads="2"
>>>              maxThreads="2" />
>>>
>>>
>>> package org.apache.catalina.core;
>>>
>>> import java.util.concurrent.ThreadPoolExecutor;
>>> import java.util.concurrent.TimeUnit;
>>> import org.apache.catalina.LifecycleException;
>>>
>>>
>>> public class TerminatingThreadExecutor extends StandardThreadExecutor {
>>>
>>>    public void start()
>>>        throws LifecycleException
>>>    {
>>>        lifecycle.fireLifecycleEvent(BEFORE_START_EVENT, null);
>>>
>>>        TaskQueue taskqueue = new TaskQueue();
>>>        TaskThreadFactory tf = new
>>> TerminatingTaskThreadFactory(namePrefix);
>>>
>>>        lifecycle.fireLifecycleEvent(START_EVENT, null);
>>>        executor = new ThreadPoolExecutor(getMinSpareThreads(),
>>> getMaxThreads(),
>>>                                          maxIdleTime,
>>> TimeUnit.MILLISECONDS,
>>>                                          taskqueue, tf);
>>>        taskqueue.setParent(executor);
>>>        lifecycle.fireLifecycleEvent(AFTER_START_EVENT, null);
>>>    }
>>>
>>>    protected class TerminatingTaskThreadFactory
>>>        extends StandardThreadExecutor.TaskThreadFactory {
>>>
>>>        protected TerminatingTaskThreadFactory(String namePrefix)
>>>        {
>>>            super(namePrefix);
>>>        }
>>>
>>>        public Thread newThread(Runnable runnable)
>>>        {
>>>            Thread t = new TerminatingThread(group, runnable, namePrefix
+
>>> threadNumber.getAndIncrement());
>>>
>>>            t.setDaemon(daemon);
>>>            t.setPriority(getThreadPriority());
>>>
>>>            return t;
>>>        }
>>>    }
>>> }
>>>
>

Mime
View raw message