oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Question about xmlrpc
Date Wed, 22 Feb 2012 22:18:22 GMT
Hey Cam,

On Feb 22, 2012, at 8:02 AM, Cameron Goodale wrote:

> Irina et al,
> 
> I have run into this same issue with exceeding the 100 connections to
> xml-rpc filemanager, and workflow manager and here were the steps I took to
> try and avoid hitting the limit ever.
> 
> 1.  Optimize the Lucene Index (if you are using the Lucene Catalog and have
> 100,000's of entries this can help improve how quickly your requests are
> handled and this will free up used connections faster).
> 2.  I used lsof to detect how many active connections where made to the
> FileManager, if the number exceeded 85 (to be safe) I would have my
> submission code sleep for 10 seconds and try again.  Not the most
> performant fix, but I never lose a job submission.
> 
> Option 2 was written in Python, and if you want a copy of it just let me
> know and will be happy to fwd it along.

How about a JIRA issue in Apache OODT ville and a devops script, 
committed to .... trunk/workflow/src/main/bin?

:)

Cheers,
Chris

> 
> On Tue, Feb 21, 2012 at 2:21 PM, holenoter <holenoter@me.com> wrote:
> 
>> 
>> hey irina,
>> 
>> how many retries do you have set for each task and how long do is your
>> interval between retries?
>> 
>> -brian
>> 
>> On Feb 21, 2012, at 09:56 AM, "Tkatcheva, Irina N (388D)" <
>> irina.n.tkatcheva@jpl.nasa.gov> wrote:
>> 
>> Hi Brian and all,
>> 
>> I have noticed that the system does recover after the "System overload:
>> Maximum number of concurren trequests (100) exceeded" message, but usually
>> some jobs stay in 'Waiting on resource (executing)' condition and never
>> proceed further. I have seen it every time after the overload messages. I
>> usually run a test that runs a bunch of jobs overnight. If there is no
>> overload messages, all jobs are completed; if there are overload messages,
>> usually in the morning some jobs are stuck in 'Waiting on resource
>> (executing)' state. So it looks to me that the system does not recover
>> completely.
>> 
>> Irina
>> 
>> 
>> 
>> On Feb 17, 2012, at 9:17 AM, Brian Foster wrote:
>> 
>> Hey Chris,
>> 
>> ya I'm in favor of adding the property but let's make it use 100 by
>> default if the property is not set and I would even say let's add it to the
>> properties file but comment it out or something.. that's a really advanced
>> flag which only needs to be changed to get rid of that logging message...
>> CAS works fine even when that message is being thrown... I think it prints
>> to sndout, otherwise I would have just turned the logging for that off back
>> when I added the client retry handlers that fixed the issue... oh and this
>> is another thing your probably gonna want to port to trunk workflow :)
>> 
>> -Brian
>> 
>> "Mattmann, Chris A (388J)" <chris.a.mattmann@jpl.nasa.gov<mailto:
>> chris.a.mattmann@jpl.nasa.gov>> wrote:
>> 
>> Thanks Brian, I was thinking this too, +1, which is why I cautioned
>> against any number greater than 256
>> in terms of thread count in my reply email too, since the risk is either
>> that (a) you have to increase the
>> ulimit (which extends the boundaries from devops oriented updates to
>> sysops on the sysadmin side);
>> and (b) the JVM will likely start trashing unless there is an inordinate
>> amount of RAM, or swap space, etc.
>> 
>> I think the best solution here is to simply make it a configurable
>> property and then encourage projects
>> to use a sensible default that's not too large...
>> 
>> Cheers,
>> Chris
>> 
>> On Feb 16, 2012, at 12:52 AM, Brian Foster wrote:
>> 
>> You have to be careful with the number you set that too because you are
>> basically telling XML-RPC that it is now allowed to create 2000 threads in
>> the same JVM... not a good practice... I don't remember the exact number
>> but the JVM will crash if it creates a certain number of threads because
>> there is a limit to the number of threads one process can create and I
>> believe this is restricted at the operating system level... and i believe
>> this number is less than 2000... The trunk filemgr and wengine already have
>> built-in client retry handling support and are configurable via java
>> properties (i.e.
>> org.apache.oodt.cas.filemgr.system.xmlrpc.connection.retries and
>> o.a.o.c.filemger.system.connection.retry.interval.seconds and there are
>> similar ones for wengine)... The message you are seeing is XML-RPC server
>> logging that it already using a 100 worker threads... you will see this
>> message if you create a 100+ jobs in the RM (e.g. Workflow Conditions and
>> Tasks) and they all start talking to the workflow manager or file manger at
>> the same time... the client retry handlers will catch this error and just
>> wait and retry again... you shouldn't be loosing any data... the only
>> inconvenience I guess is that message is cluttering the logs
>> 
>> -Brian
>> 
>> On Feb 15, 2012, at 10:42 PM, "Cheng, Cecilia S (388K)" <
>> cecilia.s.cheng@jpl.nasa.gov<mailto:cecilia.s.cheng@jpl.nasa.gov>> wrote:
>> 
>> 
>> Hi Chris,
>> 
>> Sure we can discuss this in dev@oodt.apache.org<mailto:dev@oodt.apache.org
>>> .
>> 
>> If you feel comfortable w/ the 2000 number, of course I can push the patch
>> upstream into Apache OODT. But what kind of tests, if any, should we do
>> before we deliver the patch? Our projects are concerned that if we
>> arbitrarily set a number, we don't know what other problems it might cause.
>> 
>> Thanks,
>> Cecilia
>> 
>> On 2/15/12 10:07 PM, "Mattmann, Chris A (388J)"
>> <chris.a.mattmann@jpl.nasa.gov <chrisa.mattmann@jpl.nasa.gov><mailto:
>> chris.a.mattmann@jpl.nasa.gov>> wrote:
>> 
>> Hi Cecilia,
>> 
>> This is really good news!
>> 
>> A couple questions:
>> 
>> 1. Do you think you would be willing to push your XML-RPC patches upstream
>> into Apache OODT so others in the
>> community could benefit? This would involve filing corresponding JIRA
>> issue(s)
>> [1], and then letting the dev@oodt.apache.org<mailto:dev@oodtapache.org<dev@oodt.apache.org>
>>> 
>> know.
>> 
>> 2. Can we move this conversation onto dev@oodt.apache.org<mailto:
>> dev@oodt.apache.org>? I think others
>> could benefit from the answers below.
>> 
>> Thanks and let me know. If you'd like to discuss more, that's fine too, but
>> I'd urge us to move this onto the public Apache OODT
>> lists.
>> 
>> Cheers,
>> Chris
>> 
>> [1] http://issues.apache.org/jira/browse/OODT
>> 
>> On Feb 15, 2012, at 2:31 PM, Cheng, Cecilia S (388K) wrote:
>> 
>> Hi Chris and Paul,
>> 
>> Just want to fill you in on where we are w/ the xmlrpc problem that we see
>> on
>> ACOS and PEATE and get your advice.
>> 
>> As you might recall, on both projects, and in all 3 components (FM, RM, and
>> WEngine), we will periodically see the following message in the console:
>> 
>> java.lang.RuntimeException: System overload: Maximum number of concurrent
>> requests (100) exceeded
>> 
>> when the system is very busy. Since upgrading to the newer version of
>> xmlrpc
>> seems to be quite involved, we thought that we will just download the
>> source
>> code and change the hardcoded number of 100 to something bigger, recompile
>> the jar file and use that in our system.
>> 
>> So I set the number to 2000 and have Lan, Michael and Irina try again. All
>> 3
>> of them said that it solved their problems, but now that this works, we
>> have
>> other concerns:
>> 
>> [1] Will setting this number so high (2000 vs. 100) create other problems?
>> [2] How can we find out what is a “good” number to use?
>> [3] What are some ways I can monitor these concurrent requests as they run?
>> netstat?
>> 
>> Would you please share your thought on this?
>> 
>> Thanks,
>> Cecilia
>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov<mailto:chrisa.mattmann@nasa.gov>
>> WWW: http://sunset.usc.edu/~mattmann/
>> Phone: +1 (818) 354-8810
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov>
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
> 
> 
> -- 
> 
> Sent from a Tin Can attached to a String


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message