Return-Path: X-Original-To: apmail-oodt-dev-archive@www.apache.org Delivered-To: apmail-oodt-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E8A159587 for ; Tue, 21 Feb 2012 23:12:12 +0000 (UTC) Received: (qmail 764 invoked by uid 500); 21 Feb 2012 23:12:12 -0000 Delivered-To: apmail-oodt-dev-archive@oodt.apache.org Received: (qmail 736 invoked by uid 500); 21 Feb 2012 23:12:12 -0000 Mailing-List: contact dev-help@oodt.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@oodt.apache.org Delivered-To: mailing list dev@oodt.apache.org Received: (qmail 728 invoked by uid 99); 21 Feb 2012 23:12:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 23:12:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of holenoter@me.com designates 17.148.16.91 as permitted sender) Received: from [17.148.16.91] (HELO asmtpout016.mac.com) (17.148.16.91) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 23:12:04 +0000 MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_KoAlj7803aO6Bu4BquMBkw)" Received: from spool002.mac.com ([10.150.69.52]) by asmtp016.mac.com (Oracle Communications Messaging Server 7u4-23.01 (7.0.4.23.0) 64bit (built Aug 10 2011)) with ESMTP id <0LZR00L6EN3HPD30@asmtp016.mac.com> for dev@oodt.apache.org; Tue, 21 Feb 2012 23:11:42 +0000 (GMT) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.6.7361,1.0.260,0.0.0000 definitions=2012-02-21_08:2012-02-21,2012-02-21,1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 suspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=6.0.2-1012030000 definitions=main-1202210262 Received: from localhost ([10.150.79.226]) by spool002.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTP id <0LZR00DYDN3FNO20@spool002.mac.com>; Tue, 21 Feb 2012 23:11:41 +0000 (GMT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 suspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=6.0.2-1012030000 definitions=main-1202210262 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.6.7361,1.0.260,0.0.0000 definitions=2012-02-21_08:2012-02-21,2012-02-21,1970-01-01 signatures=0 To: "Tkatcheva, Irina N (388D)" Cc: dev@oodt.apache.org From: holenoter Subject: Re: RE: Question about xmlrpc Date: Tue, 21 Feb 2012 23:11:39 +0000 (GMT) X-Mailer: MobileMe Mail (1C323414) X-Originating-IP: [74.125.59.129] Message-id: <848cf01d-2db4-c0ef-8b5d-a67ef4f20a6d@me.com> In-reply-to: <63E24D720DCE7246ABAA722B52D23CDF015D2C517597@ALTPHYEMBEVSP10.RES.AD.JPL> X-Virus-Checked: Checked by ClamAV on apache.org --Boundary_(ID_KoAlj7803aO6Bu4BquMBkw) Content-type: text/plain; charset=utf-8; format=flowed Content-transfer-encoding: quoted-printable hey irina, =0A=0Atry increasing the number of retries to something like 10= 0 or 200 and see if you get the same problems, basically with your setup i= t will only retry for 10 minutes... if you launch a bunch of jobs, especia= lly with a lot of conditions, they are going to keep the workflow manager = overloaded for a while... if this doesn't fix the problem it seems like ma= ybe there may be a synchronization bug in wengine=0A=0A-brian=EF=BB=BF=0A=0A= On Feb 21, 2012, at 02:26 PM, "Tkatcheva, Irina N (388D)" wrote:=0A=0A> Hi Brian,=0A>=0A> We have=0A> =0A> =0A>=0A> Irina=0A> _________________________= _______________=0A> From: holenoter [holenoter@me.com]=0A> Sent: Tuesday, = February 21, 2012 2:21 PM=0A> To: dev@oodt.apache.org=0A> Cc: Tkatcheva, I= rina N (388D)=0A> Subject: Re: Question about xmlrpc=0A>=0A> hey irina,=0A= >=0A> how many retries do you have set for each task and how long do is yo= ur interval between retries?=0A>=0A> -brian=0A>=0A> On Feb 21, 2012, at 09= :56 AM, "Tkatcheva, Irina N (388D)" wrote= :=0A>=0A> Hi Brian and all,=0A>=0A> I have noticed that the system does re= cover after the "System overload: Maximum number of concurren trequests (1= 00) exceeded" message, but usually some jobs stay in 'Waiting on resource = (executing)' condition and never proceed further. I have seen it every tim= e after the overload messages. I usually run a test that runs a bunch of j= obs overnight. If there is no overload messages, all jobs are completed; i= f there are overload messages, usually in the morning some jobs are stuck = in 'Waiting on resource (executing)' state. So it looks to me that the sys= tem does not recover completely.=0A>=0A> Irina=0A>=0A>=0A>=0A> On Feb 17, = 2012, at 9:17 AM, Brian Foster wrote:=0A>=0A> Hey Chris,=0A>=0A> ya I'm in= favor of adding the property but let's make it use 100 by default if the = property is not set and I would even say let's add it to the properties fi= le but comment it out or something.. that's a really advanced flag which o= nly needs to be changed to get rid of that logging message... CAS works fi= ne even when that message is being thrown... I think it prints to sndout, = otherwise I would have just turned the logging for that off back when I ad= ded the client retry handlers that fixed the issue... oh and this is anoth= er thing your probably gonna want to port to trunk workflow :)=0A>=0A> -Br= ian=0A>=0A> "Mattmann, Chris A (388J)" >> wrote:=0A>=0A> Thanks Brian, I was th= inking this too, +1, which is why I cautioned against any number greater t= han 256=0A> in terms of thread count in my reply email too, since the risk= is either that (a) you have to increase the=0A> ulimit (which extends the= boundaries from devops oriented updates to sysops on the sysadmin side);=0A= > and (b) the JVM will likely start trashing unless there is an inordinate= amount of RAM, or swap space, etc.=0A>=0A> I think the best solution here= is to simply make it a configurable property and then encourage projects=0A= > to use a sensible default that's not too large...=0A>=0A> Cheers,=0A> Ch= ris=0A>=0A> On Feb 16, 2012, at 12:52 AM, Brian Foster wrote:=0A>=0A> You = have to be careful with the number you set that too because you are basica= lly telling XML-RPC that it is now allowed to create 2000 threads in the s= ame JVM... not a good practice... I don't remember the exact number but th= e JVM will crash if it creates a certain number of threads because there i= s a limit to the number of threads one process can create and I believe th= is is restricted at the operating system level... and i believe this numbe= r is less than 2000... The trunk filemgr and wengine already have built-in= client retry handling support and are configurable via java properties (i= e. org.apache.oodt.cas.filemgr.system.xmlrpc.connection.retries and o.a.o= c.filemger.system.connection.retry.interval.seconds and there are similar= ones for wengine)... The message you are seeing is XML-RPC server logging= that it already using a 100 worker threads... you will see this message i= f you create a 100+ jobs in the RM (e.g. Workflow Conditions and Tasks) an= d they all start talking to the workflow manager or file manger at the sam= e time... the client retry handlers will catch this error and just wait an= d retry again... you shouldn't be loosing any data... the only inconvenien= ce I guess is that message is cluttering the logs=0A>=0A> -Brian=0A>=0A> O= n Feb 15, 2012, at 10:42 PM, "Cheng, Cecilia S (388K)" >> wrote:=0A>=0A>=0A> Hi Chr= is,=0A>=0A> Sure we can discuss this in dev@oodt.apache.org>.=0A>= =0A> If you feel comfortable w/ the 2000 number, of course I can push the = patch=0A> upstream into Apache OODT. But what kind of tests, if any, shoul= d we do=0A> before we deliver the patch? Our projects are concerned that i= f we=0A> arbitrarily set a number, we don't know what other problems it mi= ght cause.=0A>=0A> Thanks,=0A> Cecilia=0A>=0A> On 2/15/12 10:07 PM, "Mattm= ann, Chris A (388J)"=0A> >> wrote:=0A>=0A> Hi Cecilia,=0A>=0A> This is really go= od news!=0A>=0A> A couple questions:=0A>=0A> 1. Do you think you would be = willing to push your XML-RPC patches upstream=0A> into Apache OODT so othe= rs in the=0A> community could benefit? This would involve filing correspon= ding JIRA issue(s)=0A> [1], and then letting the dev@oodt.apache.org>=0A> know.=0A>=0A> 2. Can we move this conversation onto dev@oodt.apach= e.org>? I think others=0A> could benefit from the answers below.=0A= >=0A> Thanks and let me know. If you'd like to discuss more, that's fine t= oo, but=0A> I'd urge us to move this onto the public Apache OODT=0A> lists= =0A>=0A> Cheers,=0A> Chris=0A>=0A> [1] http://issues.apache.org/jira/brow= se/OODT=0A>=0A> On Feb 15, 2012, at 2:31 PM, Cheng, Cecilia S (388K) wrote= :=0A>=0A> Hi Chris and Paul,=0A>=0A> Just want to fill you in on where we = are w/ the xmlrpc problem that we see on=0A> ACOS and PEATE and get your a= dvice.=0A>=0A> As you might recall, on both projects, and in all 3 compone= nts (FM, RM, and=0A> WEngine), we will periodically see the following mess= age in the console:=0A>=0A> java.lang.RuntimeException: System overload: M= aximum number of concurrent=0A> requests (100) exceeded=0A>=0A> when the s= ystem is very busy. Since upgrading to the newer version of xmlrpc=0A> see= ms to be quite involved, we thought that we will just download the source=0A= > code and change the hardcoded number of 100 to something bigger, recompi= le=0A> the jar file and use that in our system.=0A>=0A> So I set the numbe= r to 2000 and have Lan, Michael and Irina try again. All 3=0A> of them sai= d that it solved their problems, but now that this works, we have=0A> othe= r concerns:=0A>=0A> [1] Will setting this number so high (2000 vs. 100) cr= eate other problems?=0A> [2] How can we find out what is a =E2=80=9Cgood=E2= =80=9D number to use?=0A> [3] What are some ways I can monitor these concu= rrent requests as they run?=0A> netstat?=0A>=0A> Would you please share yo= ur thought on this?=0A>=0A> Thanks,=0A> Cecilia=0A>=0A>=0A>=0A> ++++++++++= ++++++++++++++++++++++++++++++++++++++++++++++++++++++++=0A> Chris Mattman= n, Ph.D.=0A> Senior Computer Scientist=0A> NASA Jet Propulsion Laboratory = Pasadena, CA 91109 USA=0A> Office: 171-266B, Mailstop: 171-246=0A> Email: = chris.a.mattmann@nasa.gov>=0A> WWW: http://sunset= usc.edu/~mattmann/=0A> Phone: +1 (818) 354-8810=0A> +++++++++++++++++++++= +++++++++++++++++++++++++++++++++++++++++++++=0A> Adjunct Assistant Profes= sor, Computer Science Department=0A> University of Southern California, Lo= s Angeles, CA 90089 USA=0A> ++++++++++++++++++++++++++++++++++++++++++++++= ++++++++++++++++++++=0A>=0A>=0A>=0A>=0A> +++++++++++++++++++++++++++++++++= +++++++++++++++++++++++++++++++++=0A> Chris Mattmann, Ph.D.=0A> Senior Com= puter Scientist=0A> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA=0A= > Office: 171-266B, Mailstop: 171-246=0A> Email: chris.a.mattmann@nasa.gov= >=0A> WWW: http://sunset.usc.edu/~mattmann/=0A>= ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++=0A> Ad= junct Assistant Professor, Computer Science Department=0A> University of S= outhern California, Los Angeles, CA 90089 USA=0A> ++++++++++++++++++++++++= ++++++++++++++++++++++++++++++++++++++++++=0A>=0A>=0A= --Boundary_(ID_KoAlj7803aO6Bu4BquMBkw) Content-type: multipart/related; boundary="Boundary_(ID_vhRcnUzVrwbABX5jcoHpEg)"; type="text/html" --Boundary_(ID_vhRcnUzVrwbABX5jcoHpEg) Content-type: text/html; charset=windows-1252 Content-transfer-encoding: quoted-printable
hey irina, 

try increasing the num= ber of =0Aretries to something like 100 or 200 and see if you get the same= =0Aproblems, basically with your setup it will only retry for 10 minutes.= .=0A if you launch a bunch of jobs, especially with a lot of conditions, = =0Athey are going to keep the workflow manager overloaded for a while... i= f=0A this doesn't fix the problem it seems like maybe there may be a =0Asy= nchronization bug in wengine

-brian

On Feb 21, 2012, at 02:26 PM, "Tkatcheva, Irina N (388D)" <irina= n.tkatcheva@jpl.nasa.gov> wrote:

Hi Brian,
=0A
=0AW= e have
=0A<property name=3D"connectionRetries" value=3D"20"/>
= =0A<property name=3D"connectionRetryIntervalSecs" value=3D"30"/>
= =0A
=0AIrina
=0A________________________________________
=0AFrom:= holenoter [holenoter@me.com]
=0ASent: Tuesday, February 21, 2012 2= :21 PM
=0ATo: dev@oodt.apache.org
=0ACc: Tkatcheva, Irina = N (388D)
=0ASubject: Re: Question about xmlrpc
=0A
=0Ahey irina,<= br>=0A
=0Ahow many retries do you have set for each task and how long d= o is your interval between retries?
=0A
=0A-brian
=0A
=0AOn Fe= b 21, 2012, at 09:56 AM, "Tkatcheva, Irina N (388D)" <irina.n.tkatcheva@jpl.nasa.gov> wrote:
=0A
=0AHi B= rian and all,
=0A
=0AI have noticed that the system does recover aft= er the "System overload: Maximum number of concurren trequests (100) excee= ded" message, but usually some jobs stay in 'Waiting on resource (executin= g)' condition and never proceed further. I have seen it every time after t= he overload messages. I usually run a test that runs a bunch of jobs overn= ight. If there is no overload messages, all jobs are completed; if there a= re overload messages, usually in the morning some jobs are stuck in 'Waiti= ng on resource (executing)' state. So it looks to me that the system does = not recover completely.
=0A
=0AIrina
=0A
=0A
=0A
=0AOn F= eb 17, 2012, at 9:17 AM, Brian Foster wrote:
=0A
=0AHey Chris,
=0A=
=0Aya I'm in favor of adding the property but let's make it use 100 by= default if the property is not set and I would even say let's add it to t= he properties file but comment it out or something.. that's a really advan= ced flag which only needs to be changed to get rid of that logging message= .. CAS works fine even when that message is being thrown... I think it pr= ints to sndout, otherwise I would have just turned the logging for that of= f back when I added the client retry handlers that fixed the issue... oh a= nd this is another thing your probably gonna want to port to trunk workflo= w :)
=0A
=0A-Brian
=0A
=0A"Mattmann, Chris A (388J)" <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov><mailto:chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>> wrote:
=0A=0AThanks Brian, I was thinking this too, +1, which is why I cautioned ag= ainst any number greater than 256
=0Ain terms of thread count in my rep= ly email too, since the risk is either that (a) you have to increase the=0Aulimit (which extends the boundaries from devops oriented updates to = sysops on the sysadmin side);
=0Aand (b) the JVM will likely start tras= hing unless there is an inordinate amount of RAM, or swap space, etc.
=0A=
=0AI think the best solution here is to simply make it a configurable = property and then encourage projects
=0Ato use a sensible default that'= s not too large...
=0A
=0ACheers,
=0AChris
=0A
=0AOn Feb 16= , 2012, at 12:52 AM, Brian Foster wrote:
=0A
=0AYou have to be caref= ul with the number you set that too because you are basically telling XML-= RPC that it is now allowed to create 2000 threads in the same JVM... not a= good practice... I don't remember the exact number but the JVM will crash= if it creates a certain number of threads because there is a limit to the= number of threads one process can create and I believe this is restricted= at the operating system level... and i believe this number is less than 2= 000... The trunk filemgr and wengine already have built-in client retry ha= ndling support and are configurable via java properties (i.e. org.apache.o= odt.cas.filemgr.system.xmlrpc.connection.retries and o.a.o.c.filemger.syst= em.connection.retry.interval.seconds and there are similar ones for wengin= e)... The message you are seeing is XML-RPC server logging that it already= using a 100 worker threads... you will see this message if you create a 1= 00+ jobs in the RM (e.g. Workflow Conditions and Tasks) and they all start= talking to the workflow manager or file manger at the same time... the cl= ient retry handlers will catch this error and just wait and retry again...= you shouldn't be loosing any data... the only inconvenience I guess is th= at message is cluttering the logs
=0A
=0A-Brian
=0A
=0AOn Feb = 15, 2012, at 10:42 PM, "Cheng, Cecilia S (388K)" <cecilia.s.cheng@jpl.nasa.gov<mailto:ceci= lia.s.cheng@jpl.nasa.gov><mailto:cecilia= s.cheng@jpl.nasa.gov<mailto:cecilia.s.chen= g@jpl.nasa.gov>>> wrote:
=0A
=0A
=0AHi Chris,
=0A=
=0ASure we can discuss this in dev@oodt.apache.org<mailto= :dev@oodt.apache.org><mailto:dev@oodt.apache.org= <mailto:dev@oodt.apache.org>>.
=0A
=0AIf you feel= comfortable w/ the 2000 number, of course I can push the patch
=0Aupst= ream into Apache OODT. But what kind of tests, if any, should we do
=0A= before we deliver the patch? Our projects are concerned that if we
=0Aa= rbitrarily set a number, we don't know what other problems it might cause.=
=0A
=0AThanks,
=0ACecilia
=0A
=0AOn 2/15/12 10:07 PM, "Mat= tmann, Chris A (388J)"
=0A<chris.a.mattman= n@jpl.nasa.gov<mailto:chrisa.mattmann@jpl.n= asa.gov><mailto:chris.a.mattmann@jpl.n= asa.gov<mailto:chris.a.mattmann@jpl.nasa.= gov>>> wrote:
=0A
=0AHi Cecilia,
=0A
=0AThis is r= eally good news!
=0A
=0AA couple questions:
=0A
=0A1. Do you t= hink you would be willing to push your XML-RPC patches upstream
=0Ainto= Apache OODT so others in the
=0Acommunity could benefit? This would in= volve filing corresponding JIRA issue(s)
=0A[1], and then letting the <= a href=3D"mailto:dev@oodt.apache.org" _mce_href=3D"mailto:dev@oodt.apache.= org">dev@oodt.apache.org<mailto:dev@oodt.apache.org>&l= t;mailto:dev@oodtapache.org<mailto:dev@oodt.apache.org>>
=0Aknow.
=0A
=0A2. Can we move this conversation onto <= a href=3D"mailto:dev@oodt.apache.org" _mce_href=3D"mailto:dev@oodt.apache.= org">dev@oodt.apache.org
<mailto:dev@oodt.apache.org>&l= t;mailto:dev@oodt.apache.org<mailto:dev@oodt.apache.org= >>? I think others
=0Acould benefit from the answers below.=0A
=0AThanks and let me know. If you'd like to discuss more, that's = fine too, but
=0AI'd urge us to move this onto the public Apache OODT=0Alists.
=0A
=0ACheers,
=0AChris
=0A
=0A[1] http://issues.apache.org/jira/browse/OODT
=0A=
=0AOn Feb 15, 2012, at 2:31 PM, Cheng, Cecilia S (388K) wrote:
=0A<= br>=0AHi Chris and Paul,
=0A
=0AJust want to fill you in on where we= are w/ the xmlrpc problem that we see on
=0AACOS and PEATE and get you= r advice.
=0A
=0AAs you might recall, on both projects, and in all 3= components (FM, RM, and
=0AWEngine), we will periodically see the foll= owing message in the console:
=0A
=0Ajava.lang.RuntimeException: Sys= tem overload: Maximum number of concurrent
=0Arequests (100) exceeded=0A
=0Awhen the system is very busy. Since upgrading to the newer ver= sion of xmlrpc
=0Aseems to be quite involved, we thought that we will j= ust download the source
=0Acode and change the hardcoded number of 100 = to something bigger, recompile
=0Athe jar file and use that in our syst= em.
=0A
=0ASo I set the number to 2000 and have Lan, Michael and Iri= na try again. All 3
=0Aof them said that it solved their problems, but = now that this works, we have
=0Aother concerns:
=0A
=0A[1] Will s= etting this number so high (2000 vs. 100) create other problems?
=0A[2]= How can we find out what is a =93good=94 number to use?
=0A[3] What ar= e some ways I can monitor these concurrent requests as they run?
=0Anet= stat?
=0A
=0AWould you please share your thought on this?
=0A
= =0AThanks,
=0ACecilia
=0A
=0A
=0A
=0A++++++++++++++++++++++= ++++++++++++++++++++++++++++++++++++++++++++
=0AChris Mattmann, Ph.D.=0ASenior Computer Scientist
=0ANASA Jet Propulsion Laboratory Pasade= na, CA 91109 USA
=0AOffice: 171-266B, Mailstop: 171-246
=0AEmail: chris.a.mattmann@nasa.gov<mailto:= chris.a.mattmann@nasa.gov><mailto:chrisa.mattman= n@nasa.gov<mailto:chrisa.mattmann@nasa.gov>&= gt;
=0AWWW: http://sunset.usc.edu/~mattmann/=0APhone: +1 (818) 354-8810
=0A+++++++++++++++++++++++++++++++++++++++= +++++++++++++++++++++++++++
=0AAdjunct Assistant Professor, Computer Sc= ience Department
=0AUniversity of Southern California, Los Angeles, CA = 90089 USA
=0A++++++++++++++++++++++++++++++++++++++++++++++++++++++++++= ++++++++
=0A
=0A
=0A
=0A
=0A+++++++++++++++++++++++++++++++= +++++++++++++++++++++++++++++++++++
=0AChris Mattmann, Ph.D.
=0ASeni= or Computer Scientist
=0ANASA Jet Propulsion Laboratory Pasadena, CA 91= 109 USA
=0AOffice: 171-266B, Mailstop: 171-246
=0AEmail: chris.a.mattmann@nasa.gov<mailto:chris.a.m= attmann@nasa.gov><mailto:chris.a.mattmann@nasa= gov<mailto:chris.a.mattmann@nasa.gov>>=0AWWW: http://sunset.usc.edu/~mattmann/
=0A+= +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=0AAd= junct Assistant Professor, Computer Science Department
=0AUniversity of= Southern California, Los Angeles, CA 90089 USA
=0A++++++++++++++++++++= ++++++++++++++++++++++++++++++++++++++++++++++
=0A
=0A
=0A
<= /div>
= --Boundary_(ID_vhRcnUzVrwbABX5jcoHpEg)-- --Boundary_(ID_KoAlj7803aO6Bu4BquMBkw)--