airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raminderjeet Singh <raminderjsi...@gmail.com>
Subject Re: Job Submission Limit
Date Mon, 03 Aug 2015 19:02:17 GMT
Job verification is already done in the provider code to make sure job
actually got to queue, so you need not to worry about it. I like the idea
of adding to queue on failure. In this approach job inputs are already
moved to a job folder and PBS script is already created. If we want to move
the job to some other resource you need to decide between existing or new
task. In case of submitting to same resource, you need to use the recovery
mechanism as the job working folder already exists.

You may still need some throttling trick with RabbitMQ queue as if user
keep submitting more job, jobs which went to recovery queue will not
recover because of server load. We may need to think about priority queue
in RabbitMQ for such jobs and handle it differently. Draw a diagram with
all these details so that we don't miss any details.

Thanks
Raminder

On Mon, Aug 3, 2015 at 2:42 PM, K Yoshimoto <kenneth@sdsc.edu> wrote:

>
> Every 5 minutes sounds like a reasonable starting point to me.
> Maybe with a configurable limit on number of retrys before stopping
> and alerting the operator.
>
> The rejection messages may vary, so what I would do is check
> the remote resource queue to see if the job is there.  If so,
> you know it succeeded, if not, then handle possible failure.
>
> On Mon, Aug 03, 2015 at 01:51:59PM -0400, John Weachock wrote:
> > I still have some questions about this method.
> >
> > When we reach the policy limit and move rejected jobs into our queue, how
> > will we determine when it's safe to attempt submission again? A regular
> > ticking event, such as every 5 minutes? Or is there another way?
> >
> > What types of rejection messages/codes will we receive? For example, what
> > happens if a job is rejected because it requests too many resources,
> rather
> > than exceeding the number of jobs?
> > On Aug 3, 2015 1:40 PM, "K Yoshimoto" <kenneth@sdsc.edu> wrote:
> >
> > >
> > > Yes, that's the idea.  In general, something dynamic and adaptable
> > > will probably be more robust than a rigid limit.
> > >
> > > On Mon, Aug 03, 2015 at 01:15:50PM -0400, John Weachock wrote:
> > > > Ah! I think I understand what you're saying now. Rather than trying
> to
> > > > ensure we stay within the policy limits, we should just submit a job
> and
> > > > check if it was accepted or not. If it was rejected, we can add it
> to a
> > > > queue to be resubmitted at a later time or to a different resource.
> Is
> > > this
> > > > correct?
> > > >
> > > > On Mon, Aug 3, 2015 at 1:10 PM, K Yoshimoto <kenneth@sdsc.edu>
> wrote:
> > > >
> > > > >
> > > > >  The point is that the policy limit could change at any time.
> > > > > If it does, and there is a mismatch in the limit at the resource
> > > > > and the limit in Airavata, bad things will happen.  Schedulers
> > > > > will vary in the format of their policy limit output, so it's
> > > > > more reliable to monitor actual job submissions and handle
> failures.
> > > > > Remember that it's possible for job limits to vary for a single
> > > > > resource not only on queue name, but on job characteristics,
> > > > > such as allocation account, core count, wall clock limit, etc.
> > > > >
> > > > > On Mon, Aug 03, 2015 at 12:53:22PM -0400, Raminderjeet Singh wrote:
> > > > > > Usually these limits are set as a policy by the resource provider
> > > and do
> > > > > > not change usually. As long as we have a place holder to
> > > configure/change
> > > > > > it in Airavata for a user/gateway, we don't need to get it from
a
> > > > > resource.
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 3, 2015 at 12:33 PM, John Weachock <
> jweachock@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > I think it would be best for us to not maintain our own
record
> of
> > > the
> > > > > job
> > > > > > > limit - we need to remember that jobs will be submitted
to
> these
> > > > > resources
> > > > > > > using the community accounts through other methods as well.
I
> > > think I
> > > > > > > remember someone mentioning that it would be ideal to poll
the
> > > > > resources
> > > > > > > for their limits - can anyone confirm that we can do this?
> > > > > > >
> > > > > > > On Mon, Aug 3, 2015 at 12:24 PM, Douglas Chau <
> > > dchau3@binghamton.edu>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hmm @shameera, that's very true. Perhaps, we can store
the
> > > submission
> > > > > > >> requests in registry. In the event that orchestrator
goes
> down we
> > > can
> > > > > > >> recover them through registry afterwards.
> > > > > > >>
> > > > > > >> @Yoshimito, I didn't think about that - will take it
into
> > > > > > >> consideration.Thanks for the insight!
> > > > > > >>
> > > > > > >> On Mon, Aug 3, 2015 at 12:11 PM, K Yoshimoto <
> kenneth@sdsc.edu>
> > > > > wrote:
> > > > > > >>
> > > > > > >>>
> > > > > > >>> I think you also want to put in a check for successful
> > > submission,
> > > > > > >>> then take appropriate action on failed submission.
 It can be
> > > > > > >>> difficult to keep the submission limit up-to-date.
> > > > > > >>>
> > > > > > >>> On Mon, Aug 03, 2015 at 11:03:46AM -0400, Douglas
Chau wrote:
> > > > > > >>> > Hey Devs,
> > > > > > >>> >
> > > > > > >>> > Just wanted to get some input on our to plan
to implement
> the
> > > queue
> > > > > > >>> > throttling feature.
> > > > > > >>> >
> > > > > > >>> > Batch Queue Throttling:
> > > > > > >>> > - in Orchestrator, the current submit() function
in
> > > > > > >>> GFACPassiveJobSubmitter
> > > > > > >>> > publishes jobs to rabbitmq immediately
> > > > > > >>> > - instead of publishing immediately we should
pass the
> > > messages to
> > > > > a
> > > > > > >>> new
> > > > > > >>> > component, call it BatchQueueClass.
> > > > > > >>> > - we need a new component BatchQueueClass
to periodically
> > > check to
> > > > > see
> > > > > > >>> when
> > > > > > >>> > we can unload jobs to submit
> > > > > > >>> >
> > > > > > >>> > Adding BatchQueueClass
> > > > > > >>> > - setup a new table(s) to contain compute
resource names
> and
> > > their
> > > > > > >>> > corresponding queues' current job numbers
and maximum job
> > > limits
> > > > > > >>> > - data models in airavata have information
on maximum job
> > > > > submission
> > > > > > >>> limits
> > > > > > >>> > for a queue but no data on how many jobs are
currently
> running
> > > > > > >>> > - the current job number will effectively
act as a counter,
> > > which
> > > > > will
> > > > > > >>> be
> > > > > > >>> > incremented when a job is submitted, and when
a job is
> > > completed
> > > > > > >>> > - once that is done, BatchQueueClass needs
to periodically
> > > check
> > > > > new
> > > > > > >>> table
> > > > > > >>> > to see if the user's requested queue's current
job number <
> > > queue
> > > > > job
> > > > > > >>> > limit. If it is then we can pop jobs off and
submit them
> until
> > > we
> > > > > hit
> > > > > > >>> the
> > > > > > >>> > job limit; if not, then we wait until the
we're under the
> job
> > > > > limit.
> > > > > > >>> >
> > > > > > >>> > How does this sound?
> > > > > > >>> >
> > > > > > >>> > Doug
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > >
> > >
>

Mime
View raw message