airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raminder Singh <raminderjsi...@gmail.com>
Subject Re: [jira] [Commented] (AIRAVATA-756) Ensure Airavata can renew proxy for long running jobs.
Date Tue, 05 Feb 2013 17:04:15 GMT
The problem is when the walltime of the job is set bigger than proxy lifetime. Before even
submitting the job, we know that job will not be able to finish using the proxy provided.
 Gram listener needs to renew the proxy. In long running jobs it need to be renewed multiple
times. Proxy lifetime is a server level properly and can't be configured per job. I had similar
experience as Pedro in the past when the Gram listener gave error while proxy renewal because
the job finished during that time.  For workaround, I have to set a higher value for all the
jobs.  Credential store may be able to solve this problem but the current solution is to increase
the value. 

Thanks
Raminder

On Feb 5, 2013, at 11:49 AM, Pedro da Silveira wrote:

> Hi Suresh,
> 
> I agree with you the right approach would be to update the proxy every 3600
> seconds, instead of creating a proxy lifetime of very high value.
> I think the fact that Airatava-Server is not reading the right value of
> myproxy.life is not the main problem, since Airavata-Server apparently is
> updating the proxy every 3600 seconds as I checked in the Airavata-Server
> log.
> I believe the real problem was when the job had finished its execution
> cycle (after 15 hours), but an error message appeared as if the GRAM tried
> to read the output of the job, but it couldn't established connection to
> Lonestar anymore, maybe because the proxy is outdated. The full error
> message is presented on the beginning of this JIRA thread. This error
> message was the last information printed in the Airavata-Server log,
> previously to that message it had printed that the job was active, and
> after 3600 seconds the proxy was renewed.
> 
> I will try to provide more information, in case I create more JIRA thread
> in the future.
> 
> 
> On Tue, Feb 5, 2013 at 9:45 AM, Suresh Marru (JIRA) <jira@apache.org> wrote:
> 
>> 
>>    [
>> https://issues.apache.org/jira/browse/AIRAVATA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571395#comment-13571395]
>> 
>> Suresh Marru commented on AIRAVATA-756:
>> ---------------------------------------
>> 
>> The bug of not honoring the properties file should be fixed. But I would
>> argue against increasing the proxy life time, thats defeats the purpose of
>> short lived proxy certificates and the philosophy behind GSI. In short
>> term, this probably needs a WA, but a better long term fix is to handle
>> proxy delegation and renewals for long running jobs.
>> 
>>> Ensure Airavata can renew proxy for long running jobs.
>>> ------------------------------------------------------
>>> 
>>>                Key: AIRAVATA-756
>>>                URL: https://issues.apache.org/jira/browse/AIRAVATA-756
>>>            Project: Airavata
>>>         Issue Type: Bug
>>>         Components: Distribution, GFac, XBaya
>>>   Affects Versions: 0.6
>>>        Environment: Mac OS 10.5.8
>>> Processor: 2 x 2.8GHz Quad-Core Intel Xeon
>>> Memory 8G 800 Mhz DDR2
>>> Java 1.6.0_26
>>>           Reporter: Pedro da Silveira
>>>           Priority: Minor
>>>            Fix For: 0.7
>>> 
>>> 
>>> After I fixed the problem on my local firewall with the suggestion made
>> by Raminderjeet Singh to let the ports # from from 40,000 to 40,100 open to
>> Airavata Server, I don't  received the message "Status 0" anymore.
>>> Although, I am still getting an error message on my Airavata-Server and
>> a red alert on Xbaya as informing that my job failed, but reality is that
>> my job ran successfully.
>>> This task took 17 hours to finish with 3 inputs in one application
>> service.
>>> According to Raminderjeet Singh, if I change the myproxy.life=3600 in
>> airavata-server.properties to myproxy.life=172800, I won't get this error
>> message anymore.
>>> ==========================
>>> Error message on Airavata-Server:
>>> ==========================
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] job
>> https://gridftp1.ls4.tacc.utexas.edu:50393/16289883153825569046/8943296923859945130/have
same status: ACTIVE
>>> [INFO] Job proxy expired. Trying to renew proxy
>>> org.globus.gsi.gssapi.GlobusGSSCredentialImpl@453931d9
>>> [INFO] Proxy file renewed to
>> /tmp/x509up_uogcebb9f81ba-8f59-4fec-b776-331c3f21bb62 for the user ogce
>> with 3600 lifetime.
>>> [ERROR] Context passed was NULL.
>>> java.lang.RuntimeException: Context passed was NULL.
>>>      at
>> org.apache.airavata.workflow.tracking.impl.ProvenanceNotifierImpl.sendingFault(ProvenanceNotifierImpl.java:496)
>>>      at
>> org.apache.airavata.workflow.tracking.impl.ProvenanceNotifierImpl.sendingFault(ProvenanceNotifierImpl.java:485)
>>>      at
>> org.apache.airavata.core.gfac.notification.impl.WorkflowTrackingNotification.executionFail(WorkflowTrackingNotification.java:108)
>>>      at
>> org.apache.airavata.core.gfac.notification.impl.DefaultNotifier.executionFail(DefaultNotifier.java:135)
>>>      at
>> org.apache.airavata.core.gfac.exception.JobSubmissionFault.sendFaultNotification(JobSubmissionFault.java:52)
>>>      at
>> org.apache.airavata.core.gfac.provider.impl.GramProvider.executeApplication(GramProvider.java:231)
>>>      at
>> org.apache.airavata.core.gfac.provider.AbstractProvider.execute(AbstractProvider.java:69)
>>>      at
>> org.apache.airavata.core.gfac.services.impl.AbstractSimpleService.execute(AbstractSimpleService.java:118)
>>>      at
>> org.apache.airavata.core.gfac.GfacAPI.gridJobSubmit(GfacAPI.java:140)
>>>      at
>> org.apache.airavata.xbaya.invoker.EmbeddedGFacInvoker.invoke(EmbeddedGFacInvoker.java:256)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.handleWSComponent(WorkflowInterpreter.java:749)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.executeDynamically(WorkflowInterpreter.java:533)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.scheduleDynamically(WorkflowInterpreter.java:218)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.executeWorkflow(WorkflowInterpretorSkeleton.java:389)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.access$400(WorkflowInterpretorSkeleton.java:87)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton$2.run(WorkflowInterpretorSkeleton.java:382)
>>>      at java.lang.Thread.run(Thread.java:680)
>>> [INFO]        -----DATA-----
>>> [INFO]                lonestar4.tacc.teragrid.org,&( queue = "normal"
>> )( stdout =
>> "/scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/lonestar_application.stdout"
>> )( count = "72" )( executable =
>> "/scratch/01437/ogce/Vlab/Phonon/executePhonon.sh" )( stderr =
>> "/scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/lonestar_application.stderr"
>> )( maxwalltime = "1440" )( hostCount = "6" )( minmemory = "10240" )(
>> project = "TG-STA110014S" )( jobtype = "mpi" )( environment = ( "inputData"
>> "/scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/inputData"
>> ) ( "outputData"
>> "/scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/outputData"
>> ) )( proxy_timeout = "1" )( arguments =
>> "///scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/inputData/Pwscf_Input"
>> "///scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/inputData/Cd_PON_sp_LDA.vdb"
>> "///scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/inputData/Te_PON_LDA.vdb"
>> "///scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3/inputData/Phonon_Input"
>> )( directory =
>> "/scratch/01437/ogce/Vlab/Phonon/__p3_14/AppPhononSingle_Wed_Jan_30_20_00_56_CST_2013_78f5e160-e1df-4008-b02b-53edfa6edbd3"
>> )( maxmemory = "15360" )
>>> [INFO]        -----END DATA-----
>>> [ERROR] The connection to the server failed (check host and port)
>> [Caused by: Connection refused]
>>> org.apache.airavata.core.gfac.exception.JobSubmissionFault: The
>> connection to the server failed (check host and port) [Caused by:
>> Connection refused]
>>>      at
>> org.apache.airavata.core.gfac.provider.impl.GramProvider.executeApplication(GramProvider.java:229)
>>>      at
>> org.apache.airavata.core.gfac.provider.AbstractProvider.execute(AbstractProvider.java:69)
>>>      at
>> org.apache.airavata.core.gfac.services.impl.AbstractSimpleService.execute(AbstractSimpleService.java:118)
>>>      at
>> org.apache.airavata.core.gfac.GfacAPI.gridJobSubmit(GfacAPI.java:140)
>>>      at
>> org.apache.airavata.xbaya.invoker.EmbeddedGFacInvoker.invoke(EmbeddedGFacInvoker.java:256)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.handleWSComponent(WorkflowInterpreter.java:749)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.executeDynamically(WorkflowInterpreter.java:533)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.scheduleDynamically(WorkflowInterpreter.java:218)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.executeWorkflow(WorkflowInterpretorSkeleton.java:389)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.access$400(WorkflowInterpretorSkeleton.java:87)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton$2.run(WorkflowInterpretorSkeleton.java:382)
>>>      at java.lang.Thread.run(Thread.java:680)
>>> Caused by: org.globus.gram.GramException: The connection to the server
>> failed (check host and port) [Caused by: Connection refused]
>>>      at org.globus.gram.Gram.renew(Gram.java:595)
>>>      at org.globus.gram.GramJob.renew(GramJob.java:329)
>>>      at org.globus.gram.GramJob.renew(GramJob.java:315)
>>>      at
>> org.apache.airavata.core.gfac.provider.utils.JobSubmissionListener.waitFor(JobSubmissionListener.java:72)
>>>      at
>> org.apache.airavata.core.gfac.provider.impl.GramProvider.executeApplication(GramProvider.java:206)
>>>      ... 11 more
>>> Exception in thread "Thread-98"
>> org.apache.airavata.workflow.model.exceptions.WorkflowRuntimeException:
>> org.apache.airavata.workflow.model.exceptions.WorkflowException: The
>> connection to the server failed (check host and port) [Caused by:
>> Connection refused]
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.executeWorkflow(WorkflowInterpretorSkeleton.java:392)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.access$400(WorkflowInterpretorSkeleton.java:87)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton$2.run(WorkflowInterpretorSkeleton.java:382)
>>>      at java.lang.Thread.run(Thread.java:680)
>>> Caused by:
>> org.apache.airavata.workflow.model.exceptions.WorkflowException: The
>> connection to the server failed (check host and port) [Caused by:
>> Connection refused]
>>>      at
>> org.apache.airavata.xbaya.invoker.EmbeddedGFacInvoker.invoke(EmbeddedGFacInvoker.java:321)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.handleWSComponent(WorkflowInterpreter.java:749)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.executeDynamically(WorkflowInterpreter.java:533)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpreter.scheduleDynamically(WorkflowInterpreter.java:218)
>>>      at
>> org.apache.airavata.xbaya.interpretor.WorkflowInterpretorSkeleton.executeWorkflow(WorkflowInterpretorSkeleton.java:389)
>>>      ... 3 more
>>> Caused by: org.apache.airavata.core.gfac.exception.JobSubmissionFault:
>> The connection to the server failed (check host and port) [Caused by:
>> Connection refused]
>>>      at
>> org.apache.airavata.core.gfac.provider.impl.GramProvider.executeApplication(GramProvider.java:229)
>>>      at
>> org.apache.airavata.core.gfac.provider.AbstractProvider.execute(AbstractProvider.java:69)
>>>      at
>> org.apache.airavata.core.gfac.services.impl.AbstractSimpleService.execute(AbstractSimpleService.java:118)
>>>      at
>> org.apache.airavata.core.gfac.GfacAPI.gridJobSubmit(GfacAPI.java:140)
>>>      at
>> org.apache.airavata.xbaya.invoker.EmbeddedGFacInvoker.invoke(EmbeddedGFacInvoker.java:256)
>>>      ... 7 more
>>> Caused by: org.globus.gram.GramException: The connection to the server
>> failed (check host and port) [Caused by: Connection refused]
>>>      at org.globus.gram.Gram.renew(Gram.java:595)
>>>      at org.globus.gram.GramJob.renew(GramJob.java:329)
>>>      at org.globus.gram.GramJob.renew(GramJob.java:315)
>>>      at
>> org.apache.airavata.core.gfac.provider.utils.JobSubmissionListener.waitFor(JobSubmissionListener.java:72)
>>>      at
>> org.apache.airavata.core.gfac.provider.impl.GramProvider.executeApplication(GramProvider.java:206)
>>>      ... 11 more
>>> ^[[B
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 


Mime
View raw message