Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4F33C39A.1070003@oracle.com>
Date: Thu, 09 Feb 2012 14:01:14 +0100
From: Kristian Waagan <kristian.waagan@oracle.com>
Organization: Oracle Corporation
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
MIME-Version: 1.0
To: derby-dev@db.apache.org
Subject: Re: SpawnedProcess arguments and behavior
References: <4F33A1E4.7040201@oracle.com> <wjo84nv0qkpu.fsf@oracle.com>
In-Reply-To: <wjo84nv0qkpu.fsf@oracle.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 09.02.2012 13:01, Knut Anders Hatlen wrote:
> Kristian Waagan<kristian.waagan@oracle.com>  writes:
>
>> Hi,
>>
>> I've been looking a bit at SpawnProcess, and I'm planning to do some
>> changes to it. The most important change is make
>> BaseTestCase.readProcessOutput use the class, since reading the output
>> from the subprocess requires extra code that should be isolated to one
>> location. There is reason to believe a problem with readProcessOutput
>> is the cause of the interrupt-related errors reported recently by
>> Myrna and, possibly, Kathey.
>>
>> What's troubling me are the arguments destroy and timeout, especially
>> the combination of the two.
>> For me, a timeout implies destroy == true. Specifying a timeout and
>> setting destroy to false is effectively the same as setting destroy to
>> true, since destroy will be forced to true when a timeout occurs.
> Agreed. I think the use case is to be able to forcefully quit a process
> immediately (it's used this way only in NetworkServerTestSetup, I
> think). We probably need to preserve that functionality, but it's
> probably less confusing if we have one method for immediate destruction
> (with no parameters) and one with a timeout (and no destroy parameter).

Yes, I added a destroy-method for this.
In this case we suspect that something may be wrong, so we know up front 
that we want to destroy the process if it doesn't terminate normally 
reasonably fast.

>> For automated test runs it would be best if complete() always returns,
>> although many test framworks have mechanisms to kill the main process
>> if it takes too long. For debugging it may be best to keep the
>> subprocess running and the main process hanging to allow for
>> inspection. I think it should be possible to obtain the stack (java
>> stack or native stack) of the subprocess, then kill it manually to get
>> stdout/stderr and have the main process continue.
>>
>> I'd prefer to settle on one of two approaches, since that would
>> simplify the code and define a consistent behavior:
>>   a) Never destroy the process.
>>   b) Always destroy the process if hanging for more than a default
>> amount of time.
>>
>> Opinions?
> Option a is of course the easier one to implement.

Yes. Option b can be implemented with a timer-task.

> Is it possible to get
> the stack of the sub-process in a portable way with option b?

I'd say no. Note the assumption that the process is a Java process - I 
believe this holds for our use for testing Derby.
Since this must work across operating system, using kill etc is a no-go. 
There's JMX, but that is way to complicated for this, I think. That 
leaves me with jps and jstack, but that sounds fragile at best...

>
> If I understand correctly, the suggestion is to always have a timeout
> when calling complete(), right? That sounds reasonable to me, provided
> that the timeout is high enough to avoid errors when the termination of
> the sub-process just happens to be slow.

Yes,  that's option b. I was thinking of a timeout in the range 2 - 15 
minutes.

>
> However, I think most of the times we've seen hangs involving
> sub-processes, they've been caused by some kind of deadlock in the
> communication between the main test process and the sub-process
> (typically both processes waiting for output from the other one). In
> those cases, the test never gets as far as to calling complete(), and a
> timeout in complete() wouldn't help.
>
> To address those cases, SpawnedProcess might need a timeout mechanism
> that automatically destroys the process if it has lived too long. But
> then the default timeout must be very high, since it must account for
> the time it takes to run the test case, not just the time it takes to
> shutdown the process after completion of the test, and we don't want the
> timeout to cause problems on slow machines.

This is definitely taking things a step further :)

I think this can also be done using a timer-task. As you note, the 
difficult thing is to get the timeout right. Again, a reasonable default 
timeout may be sufficient.  Controlling this for the 
NetworkServerControlTestSetup may be a bit more of a hassle (unless 
you're okay with having loads of arguments in the method signatures), 
but since this is a safety-net feature a default timeout of 45 - 60 
minutes would hopefully be enough...


Are we looking at something like this? (Y much smaller than X)
  o new SpawnedProcess(...)
     creates watchdog thread killing the process after X minutes
  o complete()
     waits Y minutes for the process to terminate normally, then kill 
it. Should return soon after the process terminates.
     (active wait with sleep, or waitFor + a separate thread for killing 
the process)
  o destroy()
     kills the process immediately
     (not sure if this is really needed, or if complete() will suffice)


-- 
Kristian

>