aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Helmkamp <br...@codeclimate.com>
Subject Re: Suitibility of Aurora for one-time tasks
Date Thu, 27 Feb 2014 03:45:39 GMT
Got it. Thanks. Do finished Jobs and Tasks get garbage collected
automatically at some point?

Otherwise it seems like they will stack up pretty fast. (We might run
hundreds of thousands of jobs in a day.)

BTW, Aurora does not seem to like the resources =
'{{resources[{{resource_profile}}]}}' part. I tried to fix it, but
keep getting:

    InvalidConfigError: Expected dictionary argument, got
'{{resources[{{resource_profile}}]}}'

(For now I'm using a different .aurora file for each resource configuration.)

Best,

-Bryan

On Wed, Feb 26, 2014 at 9:04 PM, Kevin Sweeney <kevints@apache.org> wrote:
> And after a bit of code spelunking the semantics you want already exist
> (just undocumented). Updated the ticket to update the documentation.
>
>
> On Wed, Feb 26, 2014 at 6:00 PM, Kevin Sweeney <kevints@apache.org> wrote:
>
>> The example I gave is somewhat syntactically invalid due to coding via
>> email, but that's more or less what the interface will look like. I also
>> filed https://issues.apache.org/jira/browse/AURORA-236 for more
>> first-class support of the semantics I think you want (though currently you
>> can fake it by setting max_failures to a very high number).
>>
>>
>> On Wed, Feb 26, 2014 at 5:33 PM, Bryan Helmkamp <bryan@codeclimate.com>wrote:
>>
>>> Thanks, Kevin. That pretty much looks like exactly what I need.
>>>
>>> -Bryan
>>>
>>> On Wed, Feb 26, 2014 at 8:16 PM, Kevin Sweeney <kevints@apache.org>
>>> wrote:
>>> > For a more dynamic approach to resource utilization you can use
>>> something
>>> > like this:
>>> >
>>> > # dynamic.aurora
>>> > *# Enqueue each individual work-item with aurora create -E
>>> > work_item=$work_item -E resource_profile=graph_traversals
>>> > west/service-account-name/prod/process_$work_item*
>>> > class Profile(Struct):
>>> >   queue_name = Required(String)
>>> >   resources = Required(Resources)
>>> >
>>> > HIGH_MEM = Resources(cpu = 8.0, ram = 32 * GB, disk = 64 * GB)
>>> > HIGH_CPU = Resources(cpu = 16.0, ram = 4 * GB, disk = 64 * GB)
>>> >
>>> > work_on_one_item = Process(name = 'work_on_one_item',
>>> >   cmdline = '''
>>> >     do_work "{{work_item}}"
>>> >   ''',
>>> > )
>>> >
>>> > task = Task(processes = [work_on_one_item],
>>> >   resources = '{{resources[{{resource_profile}}]}}')
>>> >
>>> > job = Job(
>>> >   task = task,
>>> >   cluster = 'west',
>>> >   role = 'service-account-name',
>>> >   environment = 'prod',
>>> >   name = 'process_{{work_item}}',
>>> > )
>>> >
>>> > resources = {
>>> >   'graph_traversals': HIGH_MEM,
>>> >   'compilations': HIGH_CPU,
>>> > }
>>> >
>>> > jobs = [job.bind(resources = resources)]
>>> >
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 1:08 PM, Bryan Helmkamp <bryan@codeclimate.com
>>> >wrote:
>>> >
>>> >> Sure. Yes, they are shell commands and yes they are provided different
>>> >> configuration on each run.
>>> >>
>>> >> In effect we have a number of different job types that are queued up,
>>> >> and we need to run as quickly as possible. Each job type has different
>>> >> resource requirements. Every time we run the job, we provide different
>>> >> arguments (the "payload"). For example:
>>> >>
>>> >> $ ./do_something.sh SOME_ID (Requires 1 CPU and 1GB RAM)
>>> >> $ ./do_something_else.sh SOME_OTHER_ID (Requires 4 CPU and 4GB RAM)
>>> >> [... there are about 12 of these ...]
>>> >>
>>> >> -Bryan
>>> >>
>>> >> On Wed, Feb 26, 2014 at 3:58 PM, Bill Farner <wfarner@apache.org>
>>> wrote:
>>> >> > Can you offer some more details on what the workload execution
looks
>>> >> like?
>>> >> >  Are these shell commands?  An application that's provided different
>>> >> > configuration?
>>> >> >
>>> >> > -=Bill
>>> >> >
>>> >> >
>>> >> > On Wed, Feb 26, 2014 at 12:45 PM, Bryan Helmkamp <
>>> bryan@codeclimate.com
>>> >> >wrote:
>>> >> >
>>> >> >> Thanks, Kevin. The idea of always-on workers of varying sizes
is
>>> >> >> effectively what we have right now in our non-Mesos world.
The
>>> problem
>>> >> >> is that sometimes we end up with not enough workers for certain
>>> >> >> classes of jobs (e.g. High Memory), while part of the cluster
sits
>>> >> >> idle.
>>> >> >>
>>> >> >> Conceptually, in my mind we would define approximately a dozen
>>> Tasks,
>>> >> >> one for each type of work we need to perform (with different
>>> resource
>>> >> >> requirements), and then run Jobs, each with a Task and a unique
>>> >> >> payload, but I don't think this model works with Mesos. It
seems
>>> we'd
>>> >> >> need to create a unique Task for every Job.
>>> >> >>
>>> >> >> -Bryan
>>> >> >>
>>> >> >> On Wed, Feb 26, 2014 at 3:35 PM, Kevin Sweeney <kevints@apache.org>
>>> >> wrote:
>>> >> >> > A job is a group of nearly-identical tasks plus some constraints
>>> like
>>> >> >> rack
>>> >> >> > diversity. The scheduler considers each task within a
job
>>> equivalently
>>> >> >> > schedulable, so you can't vary things like resource footprint.
>>> It's
>>> >> >> > perfectly fine to have several jobs with just a single
task, as
>>> long
>>> >> as
>>> >> >> > each has a different job key (which is (role, environment,
name)).
>>> >> >> >
>>> >> >> > Another approach is to have a bunch of uniform always-on
workers
>>> (in
>>> >> >> > different sizes). This can be expressed as a Service like
so:
>>> >> >> >
>>> >> >> > # workers.aurora
>>> >> >> > class Profile(Struct):
>>> >> >> >   queue_name = Required(String)
>>> >> >> >   resources = Required(Resources)
>>> >> >> >   instances = Required(Integer)
>>> >> >> >
>>> >> >> > HIGH_MEM = Resources(cpu = 8.0, ram = 32 * GB, disk =
64 * GB)
>>> >> >> > HIGH_CPU = Resources(cpu = 16.0, ram = 4 * GB, disk =
64 * GB)
>>> >> >> >
>>> >> >> > work_forever = Process(name = 'work_forever',
>>> >> >> >   cmdline = '''
>>> >> >> >     # TODO: Replace this with something that isn't pseudo-bash
>>> >> >> >     while true; do
>>> >> >> >       work_item=`take_from_work_queue {{profile.queue_name}}`
>>> >> >> >       do_work "$work_item"
>>> >> >> >       tell_work_queue_finished "{{profile.queue_name}}"
>>> "$work_item"
>>> >> >> >     done
>>> >> >> >   ''')
>>> >> >> >
>>> >> >> > task = Task(processes = [work_forever],
>>> >> >> > *  resources = '{{profile.resources}}, # Note this is
static per
>>> >> >> > queue-name.*
>>> >> >> > )
>>> >> >> >
>>> >> >> > service = Service(
>>> >> >> >   task = task,
>>> >> >> >   cluster = 'west',
>>> >> >> >   role = 'service-account-name',
>>> >> >> >   environment = 'prod',
>>> >> >> >   name = '{{profile.queue_name}}_processor'
>>> >> >> >   *instances = '{{profile.instances}}', # Scale here.*
>>> >> >> > )
>>> >> >> >
>>> >> >> > jobs = [
>>> >> >> >   service.bind(profile = Profile(
>>> >> >> >     resources = HIGH_MEM,
>>> >> >> >     queue_name = 'graph_traversals',
>>> >> >> >     instances = 50,
>>> >> >> >   )),
>>> >> >> >   service.bind(profile = Profile(
>>> >> >> >     resources = HIGH_CPU,
>>> >> >> >     queue_name = 'compilations',
>>> >> >> >     instances = 200,
>>> >> >> >   )),
>>> >> >> > ]
>>> >> >> >
>>> >> >> >
>>> >> >> > On Wed, Feb 26, 2014 at 11:46 AM, Bryan Helmkamp <
>>> >> bryan@codeclimate.com
>>> >> >> >wrote:
>>> >> >> >
>>> >> >> >> Thanks, Bill.
>>> >> >> >>
>>> >> >> >> Am I correct in understanding that is not possible
to
>>> parameterize
>>> >> >> >> individual Jobs, just Tasks? Therefore, since I don't
know the
>>> job
>>> >> >> >> definitions up front, I will have parameterized Task
templates,
>>> and
>>> >> >> >> generate a new Task every time I need to run a Job?
>>> >> >> >>
>>> >> >> >> Is that the recommended route?
>>> >> >> >>
>>> >> >> >> Our work is very non-uniform so I don't think work-stealing
>>> would be
>>> >> >> >> efficient for us.
>>> >> >> >>
>>> >> >> >> -Bryan
>>> >> >> >>
>>> >> >> >> On Wed, Feb 26, 2014 at 12:49 PM, Bill Farner <
>>> wfarner@apache.org>
>>> >> >> wrote:
>>> >> >> >> > Thanks for checking out Aurora!
>>> >> >> >> >
>>> >> >> >> > My short answer is that Aurora should handle
thousands of
>>> >> short-lived
>>> >> >> >> > tasks/jobs per day without trouble.  (If you
proceed with this
>>> >> >> approach
>>> >> >> >> and
>>> >> >> >> > encounter performance issues, feel free to file
tickets!)  The
>>> DSL
>>> >> >> does
>>> >> >> >> > have some mechanisms for parameterization.  In
your case since
>>> you
>>> >> >> >> probably
>>> >> >> >> > don't know all the job definitions upfront, you'll
probably
>>> want to
>>> >> >> >> > parameterize with environment variables.  I don't
see this
>>> >> described
>>> >> >> in
>>> >> >> >> our
>>> >> >> >> > docs, but you there's a little detail at the
option declaration
>>> >> [1].
>>> >> >> >> >
>>> >> >> >> > Another approach worth considering is work-stealing,
using a
>>> single
>>> >> >> job
>>> >> >> >> as
>>> >> >> >> > your pool of workers.  I would find this easier
to manage, but
>>> it
>>> >> >> would
>>> >> >> >> > only be suitable if your work items are sufficiently-uniform.
>>> >> >> >> >
>>> >> >> >> > Feel free to continue the discussion!  We're
also pretty
>>> active in
>>> >> our
>>> >> >> >> IRC
>>> >> >> >> > channel if you'd prefer that medium.
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > [1]
>>> >> >> >> >
>>> >> >> >>
>>> >> >>
>>> >>
>>> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/options.py#L170-L183
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > -=Bill
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Tue, Feb 25, 2014 at 10:11 PM, Bryan Helmkamp
<
>>> >> >> bryan@codeclimate.com
>>> >> >> >> >wrote:
>>> >> >> >> >
>>> >> >> >> >> Hello,
>>> >> >> >> >>
>>> >> >> >> >> I am considering Aurora for a key component
of our
>>> infrastructure.
>>> >> >> >> >> Awesome work being done here.
>>> >> >> >> >>
>>> >> >> >> >> My question is: How suitable is Aurora for
running short-lived
>>> >> tasks?
>>> >> >> >> >>
>>> >> >> >> >> Background: We (Code Climate) do static analysis
of tens of
>>> >> thousands
>>> >> >> >> >> of repositories every day. We run a variety
of forms of
>>> analysis,
>>> >> >> with
>>> >> >> >> >> heterogeneous resource requirements, and
thus our interest in
>>> >> Mesos.
>>> >> >> >> >>
>>> >> >> >> >> Looking at Aurora, a lot of the core features
look very
>>> helpful to
>>> >> >> us.
>>> >> >> >> >> Where I am getting hung up is figuring out
how to model
>>> >> short-lived
>>> >> >> >> >> tasks as tasks/jobs. Long-running resource
allocations are not
>>> >> really
>>> >> >> >> >> an option for us due to the variation in
our workloads.
>>> >> >> >> >>
>>> >> >> >> >> My first thought was to create a Task for
each type of
>>> analysis we
>>> >> >> >> >> run, and then start a new Job with the appropriate
Task every
>>> >> time we
>>> >> >> >> >> want to run analysis (regulated by a queue).
This doesn't
>>> seem to
>>> >> >> work
>>> >> >> >> >> though. I can't `aurora create` the same
`.aurora` file
>>> multiple
>>> >> >> times
>>> >> >> >> >> with different Job names (as far as I can
tell). Also there
>>> is the
>>> >> >> >> >> problem of how to customize each Job slightly
(e.g. a
>>> payload).
>>> >> >> >> >>
>>> >> >> >> >> An obvious alternative is to create a unique
Task every time
>>> we
>>> >> want
>>> >> >> >> >> to run work. This would result in tens of
thousands of tasks
>>> being
>>> >> >> >> >> created every day, and from what I can tell
Aurora does not
>>> >> intend to
>>> >> >> >> >> be used like that. (Please correct me if
I am wrong.)
>>> >> >> >> >>
>>> >> >> >> >> Basically, I would like to hook my job queue
up to Aurora to
>>> >> perform
>>> >> >> >> >> the actual work. There are a dozen different
types of jobs,
>>> each
>>> >> with
>>> >> >> >> >> different performance requirements. Every
time a job runs, it
>>> has
>>> >> a
>>> >> >> >> >> unique payload containing the definition
of the work it
>>> should be
>>> >> >> >> >> performed.
>>> >> >> >> >>
>>> >> >> >> >> Can Aurora be used this way? If so, what
is the proper way to
>>> >> model
>>> >> >> >> >> this with respect to Jobs and Tasks?
>>> >> >> >> >>
>>> >> >> >> >> Any/all help is appreciated.
>>> >> >> >> >>
>>> >> >> >> >> Thanks!
>>> >> >> >> >>
>>> >> >> >> >> -Bryan
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Bryan Helmkamp, Founder, Code Climate
>>> >> >> >> >> bryan@codeclimate.com / 646-379-1810 / @brynary
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Bryan Helmkamp, Founder, Code Climate
>>> >> >> >> bryan@codeclimate.com / 646-379-1810 / @brynary
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Bryan Helmkamp, Founder, Code Climate
>>> >> >> bryan@codeclimate.com / 646-379-1810 / @brynary
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Bryan Helmkamp, Founder, Code Climate
>>> >> bryan@codeclimate.com / 646-379-1810 / @brynary
>>> >>
>>>
>>>
>>>
>>> --
>>> Bryan Helmkamp, Founder, Code Climate
>>> bryan@codeclimate.com / 646-379-1810 / @brynary
>>>
>>
>>



-- 
Bryan Helmkamp, Founder, Code Climate
bryan@codeclimate.com / 646-379-1810 / @brynary

Mime
View raw message