oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From holenoter <holeno...@me.com>
Subject Re: RE: Question about xmlrpc
Date Tue, 21 Feb 2012 23:11:39 GMT
hey irina, 

try increasing the number of retries to something like 100 or 200 and see if you get the same
problems, basically with your setup it will only retry for 10 minutes... if you launch a bunch
of jobs, especially with a lot of conditions, they are going to keep the workflow manager
overloaded for a while... if this doesn't fix the problem it seems like maybe there may be
a synchronization bug in wengine

-brian

On Feb 21, 2012, at 02:26 PM, "Tkatcheva, Irina N (388D)" <irina.n.tkatcheva@jpl.nasa.gov>
wrote:

> Hi Brian,
>
> We have
> <property name="connectionRetries" value="20"/>
> <property name="connectionRetryIntervalSecs" value="30"/>
>
> Irina
> ________________________________________
> From: holenoter [holenoter@me.com]
> Sent: Tuesday, February 21, 2012 2:21 PM
> To: dev@oodt.apache.org
> Cc: Tkatcheva, Irina N (388D)
> Subject: Re: Question about xmlrpc
>
> hey irina,
>
> how many retries do you have set for each task and how long do is your interval between
retries?
>
> -brian
>
> On Feb 21, 2012, at 09:56 AM, "Tkatcheva, Irina N (388D)" <irina.n.tkatcheva@jpl.nasa.gov>
wrote:
>
> Hi Brian and all,
>
> I have noticed that the system does recover after the "System overload: Maximum number
of concurren trequests (100) exceeded" message, but usually some jobs stay in 'Waiting on
resource (executing)' condition and never proceed further. I have seen it every time after
the overload messages. I usually run a test that runs a bunch of jobs overnight. If there
is no overload messages, all jobs are completed; if there are overload messages, usually in
the morning some jobs are stuck in 'Waiting on resource (executing)' state. So it looks to
me that the system does not recover completely.
>
> Irina
>
>
>
> On Feb 17, 2012, at 9:17 AM, Brian Foster wrote:
>
> Hey Chris,
>
> ya I'm in favor of adding the property but let's make it use 100 by default if the property
is not set and I would even say let's add it to the properties file but comment it out or
something.. that's a really advanced flag which only needs to be changed to get rid of that
logging message... CAS works fine even when that message is being thrown... I think it prints
to sndout, otherwise I would have just turned the logging for that off back when I added the
client retry handlers that fixed the issue... oh and this is another thing your probably gonna
want to port to trunk workflow :)
>
> -Brian
>
> "Mattmann, Chris A (388J)" <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov><mailto:chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>>
wrote:
>
> Thanks Brian, I was thinking this too, +1, which is why I cautioned against any number
greater than 256
> in terms of thread count in my reply email too, since the risk is either that (a) you
have to increase the
> ulimit (which extends the boundaries from devops oriented updates to sysops on the sysadmin
side);
> and (b) the JVM will likely start trashing unless there is an inordinate amount of RAM,
or swap space, etc.
>
> I think the best solution here is to simply make it a configurable property and then
encourage projects
> to use a sensible default that's not too large...
>
> Cheers,
> Chris
>
> On Feb 16, 2012, at 12:52 AM, Brian Foster wrote:
>
> You have to be careful with the number you set that too because you are basically telling
XML-RPC that it is now allowed to create 2000 threads in the same JVM... not a good practice...
I don't remember the exact number but the JVM will crash if it creates a certain number of
threads because there is a limit to the number of threads one process can create and I believe
this is restricted at the operating system level... and i believe this number is less than
2000... The trunk filemgr and wengine already have built-in client retry handling support
and are configurable via java properties (ie. org.apache.oodt.cas.filemgr.system.xmlrpc.connection.retries
and o.a.oc.filemger.system.connection.retry.interval.seconds and there are similar ones for
wengine)... The message you are seeing is XML-RPC server logging that it already using a 100
worker threads... you will see this message if you create a 100+ jobs in the RM (e.g. Workflow
Conditions and Tasks) and they all start talking to the workflow manager or file manger at
the same time... the client retry handlers will catch this error and just wait and retry again...
you shouldn't be loosing any data... the only inconvenience I guess is that message is cluttering
the logs
>
> -Brian
>
> On Feb 15, 2012, at 10:42 PM, "Cheng, Cecilia S (388K)" <cecilia.s.cheng@jpl.nasa.gov<mailto:cecilia.s.cheng@jpl.nasa.gov><mailto:cecilia.s.cheng@jplnasa.gov<mailto:cecilia.s.cheng@jpl.nasa.gov>>>
wrote:
>
>
> Hi Chris,
>
> Sure we can discuss this in dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:dev@oodt.apache.org<mailto:dev@oodt.apache.org>>.
>
> If you feel comfortable w/ the 2000 number, of course I can push the patch
> upstream into Apache OODT. But what kind of tests, if any, should we do
> before we deliver the patch? Our projects are concerned that if we
> arbitrarily set a number, we don't know what other problems it might cause.
>
> Thanks,
> Cecilia
>
> On 2/15/12 10:07 PM, "Mattmann, Chris A (388J)"
> <chris.a.mattmann@jpl.nasa.gov<mailto:chrisa.mattmann@jpl.nasa.gov><mailto:chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>>
wrote:
>
> Hi Cecilia,
>
> This is really good news!
>
> A couple questions:
>
> 1. Do you think you would be willing to push your XML-RPC patches upstream
> into Apache OODT so others in the
> community could benefit? This would involve filing corresponding JIRA issue(s)
> [1], and then letting the dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:dev@oodtapache.org<mailto:dev@oodt.apache.org>>
> know.
>
> 2. Can we move this conversation onto dev@oodt.apache.org<mailto:dev@oodt.apache.org><mailto:dev@oodt.apache.org<mailto:dev@oodt.apache.org>>?
I think others
> could benefit from the answers below.
>
> Thanks and let me know. If you'd like to discuss more, that's fine too, but
> I'd urge us to move this onto the public Apache OODT
> lists
>
> Cheers,
> Chris
>
> [1] http://issues.apache.org/jira/browse/OODT
>
> On Feb 15, 2012, at 2:31 PM, Cheng, Cecilia S (388K) wrote:
>
> Hi Chris and Paul,
>
> Just want to fill you in on where we are w/ the xmlrpc problem that we see on
> ACOS and PEATE and get your advice.
>
> As you might recall, on both projects, and in all 3 components (FM, RM, and
> WEngine), we will periodically see the following message in the console:
>
> java.lang.RuntimeException: System overload: Maximum number of concurrent
> requests (100) exceeded
>
> when the system is very busy. Since upgrading to the newer version of xmlrpc
> seems to be quite involved, we thought that we will just download the source
> code and change the hardcoded number of 100 to something bigger, recompile
> the jar file and use that in our system.
>
> So I set the number to 2000 and have Lan, Michael and Irina try again. All 3
> of them said that it solved their problems, but now that this works, we have
> other concerns:
>
> [1] Will setting this number so high (2000 vs. 100) create other problems?
> [2] How can we find out what is a “good” number to use?
> [3] What are some ways I can monitor these concurrent requests as they run?
> netstat?
>
> Would you please share your thought on this?
>
> Thanks,
> Cecilia
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov><mailto:chrisa.mattmann@nasa.gov<mailto:chrisa.mattmann@nasa.gov>>
> WWW: http://sunsetusc.edu/~mattmann/
> Phone: +1 (818) 354-8810
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov><mailto:chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov>>
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message