incubator-olio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akara.Sucharita...@Sun.COM
Subject Re: OLIO-Java harness runs forever - round 2
Date Tue, 09 Feb 2010 23:45:25 GMT
On 02/09/10 15:11, Joshua Schnee wrote:
> Akara,
> 
> Can you review his comments?
> 
> 
> Thanks,
> 
> Joshua
> 
> 
> His comments:
> 
>   > The code setting the timeout is in TimedSocket.connect().
> 
> But the point in the stack trace that at least appears to my eyes to be 
> the source of the read is the UIDriver class, and that class is using 
> URLConnections. So I don't see how a call to the methods in UIDriver has 
> any such timeout. You don't explicitly create sockets nor do you 
> explicitly call connect() with URLConnections.

What source are you looking at? Is this Olio 0.2? The UIDriver is 
several layers away from the socket.
In Olio 0.2, all calls from UIDriver go through ApacheHC3Transport, 
which then goes through the Apache HttpClient stack and finally arrives 
at the socket layer. The UIDriver methods do not set or read any timeouts.

> 
> Now it is possible that somewhere several layers back, outside the scope 
> of these stack traces, *something* is creating a such a connection with 
> a timeout and within that connection this URLConnection is created 
> without a timeout, so that effectively even if the blocking read on the 
> URLConnection will wait forever, the method that called it won't. But 
> these stack traces provide no hint of such a relationship.

Each simulated client is responsible for its own connection. It does not 
take a connection from somewhere else. I don't see how this could even 
happen. But to be really sure, we'll need to make all the connect calls 
expose state.

> 
>>  The only way a thread can block on the I/O forever is if
> 
>>  both of these conditions are true:
>>
>>  1. The socket does not obey setSoTimeout
>>  2. The socket I/O interrupt does not work. A thread.interrupt
> 
>>     does not cause the waiting read call to get interrupted.
> 
> or
> 
> 3. You are not using the this connect() method to make the connection.

If you check the Java socket library source, it will always use this 
connect call to make the connection, no matter which Socket constructor 
you use - of course, unless you have a strange JVM

> 
> Lastly, I noticed there is "logging" code in the faban harness 
> (logging.finer(), etc). I haven't found how you can turn it on, but you 
> might want to ask this guy, as logging output would be a lot more 
> helpful than these stack traces for backtracking to help figure out what 
> is going on.

On the system you want to make it log, go to 
$FABAN/config/logging.properties and change or add logging for the class 
you want to see more detailed log messages. Add a property

<name>.level=<desired level>
i.e.
com.sun.faban.driver.level=FINEST

Logs are helpful. But if you have something hanging, stack dumps are better.

-Akara



> 
> 
> On Tue, Feb 9, 2010 at 11:37 AM, Akara Sucharitakul 
> <Akara.Sucharitakul@sun.com <mailto:Akara.Sucharitakul@sun.com>> wrote:
> 
>     We explicitly do this on the sockets right after the connect:
> 
>            setSoTimeout(30000); // 30 second socket read timeout.
> 
>     Moreover, the master schedules a reaper that goes and calls
>     terminate on each agent 60 seconds after the end of ramp down. The
>     terminate will call interrupt on all driver threads that are still
>     alive.
> 
>     If you kill the run, the rampdown won't be observed. It will go and
>     interrupt the threads right away.
> 
>     The code setting the timeout is in TimedSocket.connect(). Here it is:
> 
>        public void connect(SocketAddress endpoint, int timeout) throws
>     IOException {
>            // Here we intercept the connect and capture the start time.
>            DriverContext ctx = DriverContext.getContext();
>            if (ctx != null)
>                ctx.recordStartTime();
>            if (timeout <= 0)
>                timeout = 30000; // 30 second connect timeout.
>            super.connect(endpoint, timeout);
>            setSoTimeout(30000); // 30 second socket read timeout.
>        }
> 
>     Note that the read timeout of 30 seconds should be set no matter
>     what connect timeout is set.
> 
>     As for the 2 minute limit at the end, search the MasterImpl for
>     "killAtEnd" and you'll see it calls the agent's terminate method at
>     that end time. If you trace it down, you'll see that all threads
>     waiting for socket read will get interrupted and terminated.
> 
>     The only way a thread can block on the I/O forever is if both of
>     these conditions are true:
> 
>     1. The socket does not obey setSoTimeout
>     2. The socket I/O interrupt does not work. A thread.interrupt does
>     not cause the waiting read call to get interrupted.
> 
> 
>     Thanks,
>     -Akara
> 
> 
> 
>     Joshua Schnee wrote:
> 
>         Akara,
> 
>         We are still experiencing this issue, across several different
>         physical clients (3 now).  I've asked one of our lead Java
>         developers to review the stack traces and although he agrees
>         with your assessment of where the harness, master and driver are
>         all hung waiting for the driver to terminate.  He said he was
>         somewhat confused by the claim that Faban will time out an
>         attempted read after 30 seconds.  He believes this to be a
>         blocking read call, one that will wait forever for data to be
>         returned unless it is interrupted somehow.
> 
>         He suggested putting in a 30-second read timeout on the threads
>         at the end of the rampdown to stop the harness from running
>         forever, but was also concerned about the run never reaching the
>         end of the rampdown period.  Does Faban still do the rampdown
>         phase if a run is being aborted?
> 
>         Also, could you possibly tell me what block of code in Faban
>         assures the timeout constraint?  He thought he might be able to
>         work backwards from there...
> 
>         Thanks,
>         Joshua
> 
>         On Tue, Feb 2, 2010 at 11:53 AM, Joshua Schnee
>         <jpschnee@gmail.com <mailto:jpschnee@gmail.com>
>         <mailto:jpschnee@gmail.com <mailto:jpschnee@gmail.com>>> wrote:
> 
>            Akara,
> 
>            We don't know of anything that is specific to our
>         configuration, but
>            one thought is that it has something to do with our running
>         it under
>            a Windows client.  We'll dig a bit deeper, but if you have any
>            ideas, they would also be appreciated.
> 
>            Thanks,
>            Joshua
> 
> 
>            On Tue, Feb 2, 2010 at 10:30 AM, Akara Sucharitakul
>            <Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>         <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>> wrote:
> 
>                Josh,
> 
>                I've checked the stacks and it turns out all driver
>         threads are
>                waiting on socket read (reading from the server). Faban has a
>                default read timeout of 30 seconds and interrupts the
>         thread and
>                therefore the I/O at 2 minutes after the rampdown has
>         ended. So
>                it should not run forever.
> 
>                Is there anything in your configuration that causes the
>         socket
>                not to timeout according to contract or be non-interruptible?
>                Otherwise I couldn't see how it could run forever.
> 
>                Here are some details:
> 
>                1. pid 608 seems to be an old stale instance of the
>         harness that
>                does not want to terminate. Please kill it off.
>                2. pid 2504 is a driver agent. All threads are waiting on
>         socket
>                read.
>                3. pid 4508 is the master. It is waiting for the driver
>         agent to
>                finish up the threads (in a join).
>                4. pid 4080 is the harness. It is waiting for the master to
>                terminate.
> 
> 
>                Thanks,
>                -Akara
> 
>                Joshua Schnee wrote:
> 
>                    Turns out I didn't get as much info as what is probably
>                    needed.  Here's an updated zip file with more dumps...
> 
>                    Thanks,
> 
> 
>                    On Mon, Feb 1, 2010 at 5:45 PM, Joshua Schnee
>                    <jpschnee@gmail.com <mailto:jpschnee@gmail.com>
>         <mailto:jpschnee@gmail.com <mailto:jpschnee@gmail.com>>
>                    <mailto:jpschnee@gmail.com
>         <mailto:jpschnee@gmail.com> <mailto:jpschnee@gmail.com
>         <mailto:jpschnee@gmail.com>>>> wrote:
> 
>                       Just realized the list didn't get copied.
> 
> 
>                       ---------- Forwarded message ----------
>                       From: *Joshua Schnee* <jpschnee@gmail.com
>         <mailto:jpschnee@gmail.com>
>                    <mailto:jpschnee@gmail.com
>         <mailto:jpschnee@gmail.com>> <mailto:jpschnee@gmail.com
>         <mailto:jpschnee@gmail.com>
>                    <mailto:jpschnee@gmail.com <mailto:jpschnee@gmail.com>>>>
>                       Date: Mon, Feb 1, 2010 at 1:46 PM
>                       Subject: Re: OLIO-Java harness runs forever - round 2
>                       To: Akara Sucharitakul <Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>
>                       <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>>>
> 
> 
>                       Thanks,
> 
>                       Here's two PIDs I dumped, let me know if these
>         aren't the
>                    ones you
>                       want as there are several.
> 
>                       Thanks,
>                       Joshua
> 
> 
>                       On Mon, Feb 1, 2010 at 1:12 PM, Akara Sucharitakul
>                       <Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>>> wrote:
> 
>                           kill -QUIT <pid>
> 
>                           or
> 
>                           $JAVA_HOME/bin/jstack <pid>
> 
>                           -Akara
> 
>                           Joshua Schnee wrote:
> 
>                               Can you tell me how to do this?  I'm not
>         sure how
>                    do this
>                               when the harness doesn't hit an exception...
> 
>                               Thanks,
>                               Joshua
> 
>                               On Mon, Feb 1, 2010 at 12:17 PM, Akara
>         Sucharitakul
>                               <Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>
>                               <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>>
>                               <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>
>                               <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>
>                    <mailto:Akara.Sucharitakul@sun.com
>         <mailto:Akara.Sucharitakul@sun.com>>>>> wrote:
> 
>                                  Can you please obtain me a stack dump
>         of the Faban
>                               harness? If it
>                                  hangs somewhere, we'll be able to diagnose
>                    better. Thanks.
> 
>                                  -Akara
> 
> 
>                                  Joshua Schnee wrote:
> 
>                                      OK, so I'm seeing the harness just run
>                    indefinitely
>                               again, and
>                                      this time it isn't related to the
>         maximum
>                    open files.
> 
>                                      Details and Background information:
>                                      Faban 1.0 build 111109
>                                      Olio Java 0.2
>                                      The workload is being driven from
>         physical
>                    Windows client
>                                      against 2 VMs on a system that is doing
>                    many other tasks.
> 
>                                      Usage Info:
>                                      Physical Client usage : Avg ~
>         22.93%, Max
>                    ~ 42.15%
>                                      Total System under test Utilization
>         ~95%
>                                      Web avg util = 5.3%, Max =40.83%
>         (during
>                    manual reload)
>                                      DB   avg uili = 4.2%, Max = 55.6%
>         (during
>                    auto reload)
> 
>                                      Granted, the system under test is near
>                    saturation,
>                               the client
>                                      isn't.  I'm not sure why the
>         harness is never
>                               exiting.  Even if
>                                      the VMs or even the system under
>         test gets so
>                               saturated that
>                                      they can't respond to requests,
>         shouldn't
>                    the test,
>                               which is
>                                      running on the under utilized client
>                    finish regardless,
>                                      reporting whatever results it can?
>                     Shanti, you had
>                               previously
>                                      asked me to file a JIRA, on this,
>         of which
>                    I forgot.
>                                I can do
>                                      so now, if you'd like.  Finally,
>         glassfish
>                    appears to
>                               be stuck,
>                                      it's running, but not responding to
>         requests,
>                               probably due to
>                                      the SEVERE entry in the server.log file
>                    (see below).
> 
>                                      Faban\Logs:
>                                       agent.log = empty
>                                       cmdagent.log = No issues
>                                       faban.log.xml = No issues
> 
>                                      Master\Logs
>                                       catalina.out = No issues
>                                       localhost*log*txt = No issues
>                                       OlioDriver.3C\
>                                       log.xml : Two types of issues,
>                    doTagSearch and the
>                               catastrophic
>                                      "Forcefully terminating benchmark
>         run" :
>                    attached
>                                        GlassFish logs
>                                        jvm.log : Numerous dependency_failed
>                    entries :
>                               attached          server.log : Serveral SEVERE
>                    entires :
>                               Most notibly one where "a
>                                      signal was attempted before wait()"
>         : Attached
>                                      Any help resolving this would me much
>                    appreciated,
>                                      -Josh
> 
> 
> 
> 
> 
>                               --            -Josh
> 
> 
> 
> 
> 
>                       --    -Josh
> 
> 
> 
> 
>                       --    -Josh
> 
> 
> 
> 
>                    --            -Josh
> 
> 
> 
> 
> 
>            --    -Josh
> 
> 
> 
> 
>         -- 
>         -Josh
> 
> 
> 
> 
> 
> -- 
> -Josh
> 


Mime
View raw message