oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cameron Goodale <sigep...@gmail.com>
Subject Re: Question about xmlrpc
Date Thu, 23 Feb 2012 00:30:00 GMT
Chris,

I think the lsof guard could probably be added into the existing
wmgr-client shell script and possibly activated via some option like number
of connections.  The Python I have attached is extremely closely tied to
running MODSCAG jobs for snow.  I will create a JIRA issue for the
enhancement to check for connections within the wmgr-client script.

Michael,

A few key areas in the script you might need to tweak are as follows:

Here you will want to swap out your lsof exec for mine, and we run workflow
mgr on port 9001
37   cmd = '/usr/sbin/lsof -i :9001 | grep ESTABLISHED | wc'

Line 48 - That is our implementation specific wmgr-client command string
and pass in out specific Metadata Key/Values to kick off the modscagv2
event.

One last note:
I miss spoke in my previous email since i was working from memory.  In this
code I am checking the number of connections to the workflow manager
(localhost:9001) after 8 jobs are submitted, and when I find over 30
connections we sleep for 60 seconds.  Since we are using the resource
manager as well, the number of connections to the workflow manager can
climb pretty quickly, so this was a pretty conservative setup.  Feel free
to tweak it to your liking, and if you have any questions let me know.

You could probably shorten up lines 59 - 76 based on your use case (and
looking back at the code again I could clean mine up as well).

Hope this helps.


-Cameron


On Wed, Feb 22, 2012 at 2:18 PM, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hey Cam,
>
> On Feb 22, 2012, at 8:02 AM, Cameron Goodale wrote:
>
> > Irina et al,
> >
> > I have run into this same issue with exceeding the 100 connections to
> > xml-rpc filemanager, and workflow manager and here were the steps I took
> to
> > try and avoid hitting the limit ever.
> >
> > 1.  Optimize the Lucene Index (if you are using the Lucene Catalog and
> have
> > 100,000's of entries this can help improve how quickly your requests are
> > handled and this will free up used connections faster).
> > 2.  I used lsof to detect how many active connections where made to the
> > FileManager, if the number exceeded 85 (to be safe) I would have my
> > submission code sleep for 10 seconds and try again.  Not the most
> > performant fix, but I never lose a job submission.
> >
> > Option 2 was written in Python, and if you want a copy of it just let me
> > know and will be happy to fwd it along.
>
> How about a JIRA issue in Apache OODT ville and a devops script,
> committed to .... trunk/workflow/src/main/bin?
>
> :)
>
> Cheers,
> Chris
>
> >
> > On Tue, Feb 21, 2012 at 2:21 PM, holenoter <holenoter@me.com> wrote:
> >
> >>
> >> hey irina,
> >>
> >> how many retries do you have set for each task and how long do is your
> >> interval between retries?
> >>
> >> -brian
> >>
> >> On Feb 21, 2012, at 09:56 AM, "Tkatcheva, Irina N (388D)" <
> >> irina.n.tkatcheva@jpl.nasa.gov> wrote:
> >>
> >> Hi Brian and all,
> >>
> >> I have noticed that the system does recover after the "System overload:
> >> Maximum number of concurren trequests (100) exceeded" message, but
> usually
> >> some jobs stay in 'Waiting on resource (executing)' condition and never
> >> proceed further. I have seen it every time after the overload messages.
> I
> >> usually run a test that runs a bunch of jobs overnight. If there is no
> >> overload messages, all jobs are completed; if there are overload
> messages,
> >> usually in the morning some jobs are stuck in 'Waiting on resource
> >> (executing)' state. So it looks to me that the system does not recover
> >> completely.
> >>
> >> Irina
> >>
> >>
> >>
> >> On Feb 17, 2012, at 9:17 AM, Brian Foster wrote:
> >>
> >> Hey Chris,
> >>
> >> ya I'm in favor of adding the property but let's make it use 100 by
> >> default if the property is not set and I would even say let's add it to
> the
> >> properties file but comment it out or something.. that's a really
> advanced
> >> flag which only needs to be changed to get rid of that logging
> message...
> >> CAS works fine even when that message is being thrown... I think it
> prints
> >> to sndout, otherwise I would have just turned the logging for that off
> back
> >> when I added the client retry handlers that fixed the issue... oh and
> this
> >> is another thing your probably gonna want to port to trunk workflow :)
> >>
> >> -Brian
> >>
> >> "Mattmann, Chris A (388J)" <chris.a.mattmann@jpl.nasa.gov<mailto:
> >> chris.a.mattmann@jpl.nasa.gov>> wrote:
> >>
> >> Thanks Brian, I was thinking this too, +1, which is why I cautioned
> >> against any number greater than 256
> >> in terms of thread count in my reply email too, since the risk is either
> >> that (a) you have to increase the
> >> ulimit (which extends the boundaries from devops oriented updates to
> >> sysops on the sysadmin side);
> >> and (b) the JVM will likely start trashing unless there is an inordinate
> >> amount of RAM, or swap space, etc.
> >>
> >> I think the best solution here is to simply make it a configurable
> >> property and then encourage projects
> >> to use a sensible default that's not too large...
> >>
> >> Cheers,
> >> Chris
> >>
> >> On Feb 16, 2012, at 12:52 AM, Brian Foster wrote:
> >>
> >> You have to be careful with the number you set that too because you are
> >> basically telling XML-RPC that it is now allowed to create 2000 threads
> in
> >> the same JVM... not a good practice... I don't remember the exact number
> >> but the JVM will crash if it creates a certain number of threads because
> >> there is a limit to the number of threads one process can create and I
> >> believe this is restricted at the operating system level... and i
> believe
> >> this number is less than 2000... The trunk filemgr and wengine already
> have
> >> built-in client retry handling support and are configurable via java
> >> properties (i.e.
> >> org.apache.oodt.cas.filemgr.system.xmlrpc.connection.retries and
> >> o.a.o.c.filemger.system.connection.retry.interval.seconds and there are
> >> similar ones for wengine)... The message you are seeing is XML-RPC
> server
> >> logging that it already using a 100 worker threads... you will see this
> >> message if you create a 100+ jobs in the RM (e.g. Workflow Conditions
> and
> >> Tasks) and they all start talking to the workflow manager or file
> manger at
> >> the same time... the client retry handlers will catch this error and
> just
> >> wait and retry again... you shouldn't be loosing any data... the only
> >> inconvenience I guess is that message is cluttering the logs
> >>
> >> -Brian
> >>
> >> On Feb 15, 2012, at 10:42 PM, "Cheng, Cecilia S (388K)" <
> >> cecilia.s.cheng@jpl.nasa.gov<mailto:cecilia.s.cheng@jpl.nasa.gov>>
> wrote:
> >>
> >>
> >> Hi Chris,
> >>
> >> Sure we can discuss this in dev@oodt.apache.org<mailto:
> dev@oodt.apache.org
> >>> .
> >>
> >> If you feel comfortable w/ the 2000 number, of course I can push the
> patch
> >> upstream into Apache OODT. But what kind of tests, if any, should we do
> >> before we deliver the patch? Our projects are concerned that if we
> >> arbitrarily set a number, we don't know what other problems it might
> cause.
> >>
> >> Thanks,
> >> Cecilia
> >>
> >> On 2/15/12 10:07 PM, "Mattmann, Chris A (388J)"
> >> <chris.a.mattmann@jpl.nasa.gov <chrisa.mattmann@jpl.nasa.gov><mailto:
> >> chris.a.mattmann@jpl.nasa.gov>> wrote:
> >>
> >> Hi Cecilia,
> >>
> >> This is really good news!
> >>
> >> A couple questions:
> >>
> >> 1. Do you think you would be willing to push your XML-RPC patches
> upstream
> >> into Apache OODT so others in the
> >> community could benefit? This would involve filing corresponding JIRA
> >> issue(s)
> >> [1], and then letting the dev@oodt.apache.org<mailto:dev@oodtapache.org
> <dev@oodt.apache.org>
> >>>
> >> know.
> >>
> >> 2. Can we move this conversation onto dev@oodt.apache.org<mailto:
> >> dev@oodt.apache.org>? I think others
> >> could benefit from the answers below.
> >>
> >> Thanks and let me know. If you'd like to discuss more, that's fine too,
> but
> >> I'd urge us to move this onto the public Apache OODT
> >> lists.
> >>
> >> Cheers,
> >> Chris
> >>
> >> [1] http://issues.apache.org/jira/browse/OODT
> >>
> >> On Feb 15, 2012, at 2:31 PM, Cheng, Cecilia S (388K) wrote:
> >>
> >> Hi Chris and Paul,
> >>
> >> Just want to fill you in on where we are w/ the xmlrpc problem that we
> see
> >> on
> >> ACOS and PEATE and get your advice.
> >>
> >> As you might recall, on both projects, and in all 3 components (FM, RM,
> and
> >> WEngine), we will periodically see the following message in the console:
> >>
> >> java.lang.RuntimeException: System overload: Maximum number of
> concurrent
> >> requests (100) exceeded
> >>
> >> when the system is very busy. Since upgrading to the newer version of
> >> xmlrpc
> >> seems to be quite involved, we thought that we will just download the
> >> source
> >> code and change the hardcoded number of 100 to something bigger,
> recompile
> >> the jar file and use that in our system.
> >>
> >> So I set the number to 2000 and have Lan, Michael and Irina try again.
> All
> >> 3
> >> of them said that it solved their problems, but now that this works, we
> >> have
> >> other concerns:
> >>
> >> [1] Will setting this number so high (2000 vs. 100) create other
> problems?
> >> [2] How can we find out what is a “good” number to use?
> >> [3] What are some ways I can monitor these concurrent requests as they
> run?
> >> netstat?
> >>
> >> Would you please share your thought on this?
> >>
> >> Thanks,
> >> Cecilia
> >>
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov<mailto:chrisa.mattmann@nasa.gov>
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> Phone: +1 (818) 354-8810
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattmann@nasa.gov<mailto:chris.a.mattmann@nasa.gov>
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >
> >
> > --
> >
> > Sent from a Tin Can attached to a String
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 

Sent from a Tin Can attached to a String

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message