incubator-olio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Schnee <jpsch...@gmail.com>
Subject Re: OLIO-Java harness runs forever - round 2
Date Tue, 09 Feb 2010 23:11:17 GMT
Akara,

Can you review his comments?


Thanks,

Joshua


His comments:

  > The code setting the timeout is in TimedSocket.connect().

But the point in the stack trace that at least appears to my eyes to be the
source of the read is the UIDriver class, and that class is using
URLConnections. So I don't see how a call to the methods in UIDriver has any
such timeout. You don't explicitly create sockets nor do you explicitly call
connect() with URLConnections.

Now it is possible that somewhere several layers back, outside the scope of
these stack traces, *something* is creating a such a connection with a
timeout and within that connection this URLConnection is created without a
timeout, so that effectively even if the blocking read on the URLConnection
will wait forever, the method that called it won't. But these stack traces
provide no hint of such a relationship.

> The only way a thread can block on the I/O forever is if

> both of these conditions are true:
>
> 1. The socket does not obey setSoTimeout
> 2. The socket I/O interrupt does not work. A thread.interrupt

>     does not cause the waiting read call to get interrupted.

or

3. You are not using the this connect() method to make the connection.

Lastly, I noticed there is "logging" code in the faban harness
(logging.finer(), etc). I haven't found how you can turn it on, but you
might want to ask this guy, as logging output would be a lot more helpful
than these stack traces for backtracking to help figure out what is going
on.

On Tue, Feb 9, 2010 at 11:37 AM, Akara Sucharitakul <
Akara.Sucharitakul@sun.com> wrote:

> We explicitly do this on the sockets right after the connect:
>
>        setSoTimeout(30000); // 30 second socket read timeout.
>
> Moreover, the master schedules a reaper that goes and calls terminate on
> each agent 60 seconds after the end of ramp down. The terminate will call
> interrupt on all driver threads that are still alive.
>
> If you kill the run, the rampdown won't be observed. It will go and
> interrupt the threads right away.
>
> The code setting the timeout is in TimedSocket.connect(). Here it is:
>
>    public void connect(SocketAddress endpoint, int timeout) throws
> IOException {
>        // Here we intercept the connect and capture the start time.
>        DriverContext ctx = DriverContext.getContext();
>        if (ctx != null)
>            ctx.recordStartTime();
>        if (timeout <= 0)
>            timeout = 30000; // 30 second connect timeout.
>        super.connect(endpoint, timeout);
>        setSoTimeout(30000); // 30 second socket read timeout.
>    }
>
> Note that the read timeout of 30 seconds should be set no matter what
> connect timeout is set.
>
> As for the 2 minute limit at the end, search the MasterImpl for "killAtEnd"
> and you'll see it calls the agent's terminate method at that end time. If
> you trace it down, you'll see that all threads waiting for socket read will
> get interrupted and terminated.
>
> The only way a thread can block on the I/O forever is if both of these
> conditions are true:
>
> 1. The socket does not obey setSoTimeout
> 2. The socket I/O interrupt does not work. A thread.interrupt does not
> cause the waiting read call to get interrupted.
>
>
> Thanks,
> -Akara
>
>
>
> Joshua Schnee wrote:
>
>> Akara,
>>
>> We are still experiencing this issue, across several different physical
>> clients (3 now).  I've asked one of our lead Java developers to review the
>> stack traces and although he agrees with your assessment of where the
>> harness, master and driver are all hung waiting for the driver to terminate.
>>  He said he was somewhat confused by the claim that Faban will time out an
>> attempted read after 30 seconds.  He believes this to be a blocking read
>> call, one that will wait forever for data to be returned unless it is
>> interrupted somehow.
>>
>> He suggested putting in a 30-second read timeout on the threads at the end
>> of the rampdown to stop the harness from running forever, but was also
>> concerned about the run never reaching the end of the rampdown period.  Does
>> Faban still do the rampdown phase if a run is being aborted?
>>
>> Also, could you possibly tell me what block of code in Faban assures the
>> timeout constraint?  He thought he might be able to work backwards from
>> there...
>>
>> Thanks,
>> Joshua
>>
>> On Tue, Feb 2, 2010 at 11:53 AM, Joshua Schnee <jpschnee@gmail.com<mailto:
>> jpschnee@gmail.com>> wrote:
>>
>>    Akara,
>>
>>    We don't know of anything that is specific to our configuration, but
>>    one thought is that it has something to do with our running it under
>>    a Windows client.  We'll dig a bit deeper, but if you have any
>>    ideas, they would also be appreciated.
>>
>>    Thanks,
>>    Joshua
>>
>>
>>    On Tue, Feb 2, 2010 at 10:30 AM, Akara Sucharitakul
>>    <Akara.Sucharitakul@sun.com <mailto:Akara.Sucharitakul@sun.com>>
>> wrote:
>>
>>        Josh,
>>
>>        I've checked the stacks and it turns out all driver threads are
>>        waiting on socket read (reading from the server). Faban has a
>>        default read timeout of 30 seconds and interrupts the thread and
>>        therefore the I/O at 2 minutes after the rampdown has ended. So
>>        it should not run forever.
>>
>>        Is there anything in your configuration that causes the socket
>>        not to timeout according to contract or be non-interruptible?
>>        Otherwise I couldn't see how it could run forever.
>>
>>        Here are some details:
>>
>>        1. pid 608 seems to be an old stale instance of the harness that
>>        does not want to terminate. Please kill it off.
>>        2. pid 2504 is a driver agent. All threads are waiting on socket
>>        read.
>>        3. pid 4508 is the master. It is waiting for the driver agent to
>>        finish up the threads (in a join).
>>        4. pid 4080 is the harness. It is waiting for the master to
>>        terminate.
>>
>>
>>        Thanks,
>>        -Akara
>>
>>        Joshua Schnee wrote:
>>
>>            Turns out I didn't get as much info as what is probably
>>            needed.  Here's an updated zip file with more dumps...
>>
>>            Thanks,
>>
>>
>>            On Mon, Feb 1, 2010 at 5:45 PM, Joshua Schnee
>>            <jpschnee@gmail.com <mailto:jpschnee@gmail.com>
>>            <mailto:jpschnee@gmail.com <mailto:jpschnee@gmail.com>>>
>> wrote:
>>
>>               Just realized the list didn't get copied.
>>
>>
>>               ---------- Forwarded message ----------
>>               From: *Joshua Schnee* <jpschnee@gmail.com
>>            <mailto:jpschnee@gmail.com> <mailto:jpschnee@gmail.com
>>            <mailto:jpschnee@gmail.com>>>
>>               Date: Mon, Feb 1, 2010 at 1:46 PM
>>               Subject: Re: OLIO-Java harness runs forever - round 2
>>               To: Akara Sucharitakul <Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>
>>               <mailto:Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>>>
>>
>>
>>               Thanks,
>>
>>               Here's two PIDs I dumped, let me know if these aren't the
>>            ones you
>>               want as there are several.
>>
>>               Thanks,
>>               Joshua
>>
>>
>>               On Mon, Feb 1, 2010 at 1:12 PM, Akara Sucharitakul
>>               <Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>
>>            <mailto:Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>>> wrote:
>>
>>                   kill -QUIT <pid>
>>
>>                   or
>>
>>                   $JAVA_HOME/bin/jstack <pid>
>>
>>                   -Akara
>>
>>                   Joshua Schnee wrote:
>>
>>                       Can you tell me how to do this?  I'm not sure how
>>            do this
>>                       when the harness doesn't hit an exception...
>>
>>                       Thanks,
>>                       Joshua
>>
>>                       On Mon, Feb 1, 2010 at 12:17 PM, Akara Sucharitakul
>>                       <Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>
>>                       <mailto:Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>>
>>                       <mailto:Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>
>>                       <mailto:Akara.Sucharitakul@sun.com
>>            <mailto:Akara.Sucharitakul@sun.com>>>> wrote:
>>
>>                          Can you please obtain me a stack dump of the
>> Faban
>>                       harness? If it
>>                          hangs somewhere, we'll be able to diagnose
>>            better. Thanks.
>>
>>                          -Akara
>>
>>
>>                          Joshua Schnee wrote:
>>
>>                              OK, so I'm seeing the harness just run
>>            indefinitely
>>                       again, and
>>                              this time it isn't related to the maximum
>>            open files.
>>
>>                              Details and Background information:
>>                              Faban 1.0 build 111109
>>                              Olio Java 0.2
>>                              The workload is being driven from physical
>>            Windows client
>>                              against 2 VMs on a system that is doing
>>            many other tasks.
>>
>>                              Usage Info:
>>                              Physical Client usage : Avg ~ 22.93%, Max
>>            ~ 42.15%
>>                              Total System under test Utilization ~95%
>>                              Web avg util = 5.3%, Max =40.83% (during
>>            manual reload)
>>                              DB   avg uili = 4.2%, Max = 55.6% (during
>>            auto reload)
>>
>>                              Granted, the system under test is near
>>            saturation,
>>                       the client
>>                              isn't.  I'm not sure why the harness is never
>>                       exiting.  Even if
>>                              the VMs or even the system under test gets so
>>                       saturated that
>>                              they can't respond to requests, shouldn't
>>            the test,
>>                       which is
>>                              running on the under utilized client
>>            finish regardless,
>>                              reporting whatever results it can?
>>             Shanti, you had
>>                       previously
>>                              asked me to file a JIRA, on this, of which
>>            I forgot.
>>                        I can do
>>                              so now, if you'd like.  Finally, glassfish
>>            appears to
>>                       be stuck,
>>                              it's running, but not responding to requests,
>>                       probably due to
>>                              the SEVERE entry in the server.log file
>>            (see below).
>>
>>                              Faban\Logs:
>>                               agent.log = empty
>>                               cmdagent.log = No issues
>>                               faban.log.xml = No issues
>>
>>                              Master\Logs
>>                               catalina.out = No issues
>>                               localhost*log*txt = No issues
>>                               OlioDriver.3C\
>>                               log.xml : Two types of issues,
>>            doTagSearch and the
>>                       catastrophic
>>                              "Forcefully terminating benchmark run" :
>>            attached
>>                                GlassFish logs
>>                                jvm.log : Numerous dependency_failed
>>            entries :
>>                       attached          server.log : Serveral SEVERE
>>            entires :
>>                       Most notibly one where "a
>>                              signal was attempted before wait()" :
>> Attached
>>                              Any help resolving this would me much
>>            appreciated,
>>                              -Josh
>>
>>
>>
>>
>>
>>                       --            -Josh
>>
>>
>>
>>
>>
>>               --    -Josh
>>
>>
>>
>>
>>               --    -Josh
>>
>>
>>
>>
>>            --            -Josh
>>
>>
>>
>>
>>
>>    --    -Josh
>>
>>
>>
>>
>> --
>> -Josh
>>
>>
>


-- 
-Josh

Mime
View raw message