Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8553A200B63 for ; Mon, 15 Aug 2016 19:05:45 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 83BE9160AA7; Mon, 15 Aug 2016 17:05:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A0C64160A8A for ; Mon, 15 Aug 2016 19:05:44 +0200 (CEST) Received: (qmail 28873 invoked by uid 500); 15 Aug 2016 17:05:38 -0000 Mailing-List: contact dev-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list dev@aurora.apache.org Received: (qmail 28862 invoked by uid 99); 15 Aug 2016 17:05:38 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2016 17:05:38 +0000 Received: from mail-it0-f52.google.com (mail-it0-f52.google.com [209.85.214.52]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 7B5B71A0055 for ; Mon, 15 Aug 2016 17:05:38 +0000 (UTC) Received: by mail-it0-f52.google.com with SMTP id n128so12455078ith.1 for ; Mon, 15 Aug 2016 10:05:38 -0700 (PDT) X-Gm-Message-State: AEkoouuTw9LCuZvtJK4ZOcjPLZeOuP+UT5YZk+3esXrXulXRjiGQ3DD501kii14sBkCXWZpZMWCRZf/KuOk0fw== X-Received: by 10.36.17.140 with SMTP id 134mr14614341itf.70.1471280737666; Mon, 15 Aug 2016 10:05:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.133.142 with HTTP; Mon, 15 Aug 2016 10:05:37 -0700 (PDT) In-Reply-To: References: From: Maxim Khutornenko Date: Mon, 15 Aug 2016 10:05:37 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Support instance-specific TaskConfig in CreateJob API To: dev@aurora.apache.org Content-Type: multipart/alternative; boundary=001a11446b90bbfb04053a1f3d1e archived-at: Mon, 15 Aug 2016 17:05:45 -0000 --001a11446b90bbfb04053a1f3d1e Content-Type: text/plain; charset=UTF-8 I would love to hear more about constraint use cases that don't work across jobs to see if/how we can extend Aurora to support them. As far as heterogeneous jobs go, that effort would require rethinking quite a few assumptions around fundamental Aurora principles to ensure we don't lock ourselves into the corner wrt future features by accepting an "easy to do" change short-term. I am -1 on supporting anything specific for adhoc jobs only. IMO, this has to be an all-or-nothing feature adding support for heterogeneous jobs across the stack. If you guys feel strongly about this idea, please craft a high-level design summary for the community to explore and review. On Sat, Aug 13, 2016 at 7:43 AM, Mauricio Garavaglia < mauriciogaravaglia@gmail.com> wrote: > Hi, > > We have been experimenting with the idea of having heterogeneous tasks in a > job. Mainly to support different docker container configurations (like > volumes to let tasks have different storage, different labels for logging > purposes, or ip addresses). > The main reason for using this instead of separate jobs is that scheduling > constraints doesn't work across jobs, and we may want to have rack > anti-affinity for the different instances. > > You can check how it works on the README in the repo [ > https://github.com/medallia/aurora/tree/0.13.0-medallia]. Basically the > job > includes a list of parameters that are later interpolated in the task > config during mesos task creation, so this happens at a latter time and the > different values to apply to each instance are held in the config. We can > start discussing if you think the design sounds or the feature could be > helpful and start working to move it upstream. > > We used StartJobUpdate to achieve the same purpose but required > some gymnastics during deployment that we wanted to avoid. Regarding Min > Cal's issue about short-lived tasks finishing before the update starts, we > solved it by initially configuring all the tasks with a dummy NOP ("no > operation") process that just sits there waiting to be updated. > > Mauricio > > > On Fri, Aug 12, 2016 at 3:17 PM, Min Cai wrote: > > > Thanks Maxim. Please see my previous email to David's comments for more > > detailed response. > > > > On Fri, Aug 12, 2016 at 9:24 AM, Maxim Khutornenko > > wrote: > > > > > I am cautious about merging createJob and startJobUpdate as we don't > > > support updates of adhoc jobs. It's logically unclear what adhoc job > > update > > > would mean as adhoc job instances are not intended to survive terminal > > > state. > > > > > > > +1. Our adhoc job instances could be short-lived and finished way before > > StartJobUpdate calls are made to Aurora. > > > > > > > > > > Even if we decided to do so I am afraid it would not help with the > > scenario > > > of creating a new heterogeneous job as the updater only supports a > single > > > TaskConfig target. > > > > > > > We will have to make N StartJobUpdate calls to update N distinct task > > configs so it will be expensive if N is large like > 10K. > > > > > > > > > > Speaking broadly, Aurora is built around the idea of homogenous jobs. > > It's > > > possible to have different task configs to support canaries and update > > > rolls but we treat that state as *temporary* until config > reconciliation > > > completes. > > > > > > > Agreed that the homogeneous jobs are important design consideration for > > *long-running* jobs like Services. However, most adhoc jobs are > > heterogenous by nature. For example, they might need to process different > > input files and write to different output files. Or they might take > > different parameters etc. It would be nice to extend Aurora to support > > heterogenous tasks so that it can be used for broader use cases as a > > meta-scheduler. > > > > Thanks, - Min > > > > > > > On Fri, Aug 12, 2016 at 8:03 AM, David McLaughlin < > > dmclaughlin@apache.org> > > > wrote: > > > > > > > Hi Min, > > > > > > > > I'd prefer to add support for ad-hoc jobs to startJobUpdate and > > > completely > > > > remove the notion of job create. > > > > > > > > " Also, even the > > > > > StartJobUpdate API is not scalable to a job with 10K ~ 100K task > > > > instances > > > > > and each instance has different task config since we will have to > > > invoke > > > > > StartJobUpdate for each distinct task config." > > > > > > > > > > > > What is the use case for that? Aurora was designed to have those as > > > > separate jobs. > > > > > > > > Thanks, > > > > David > > > > > > > > On Thu, Aug 11, 2016 at 2:56 PM, Min Cai wrote: > > > > > > > > > Hey fellow Aurora team: > > > > > > > > > > We would like to propose a simple and backwards compatible feature > in > > > > > CreateJob API so that we can support instance-specific TaskConfigs. > > The > > > > use > > > > > case here is for an Adhoc job which has different resource settings > > as > > > > well > > > > > as different command line arguments for each task instance. Aurora > > > today > > > > > already support heterogenous tasks for the same job via > > StartJobUpdate > > > > API, > > > > > i.e. we can update the job instances to use different task configs. > > > This > > > > > works reasonably well for long running tasks like Services. > However, > > it > > > > is > > > > > not feasible for Adhoc jobs where each task will finish right away > > > before > > > > > we even have a chance to invoke StartJobUpdate. Also, even the > > > > > StartJobUpdate API is not scalable to a job with 10K ~ 100K task > > > > instances > > > > > and each instance has different task config since we will have to > > > invoke > > > > > StartJobUpdate for each distinct task config. > > > > > > > > > > The proposal we have is to add an optional field in > JobConfiguration > > > for > > > > > instance specific task config. It will be override the default task > > > > config > > > > > for given instance ID ranges if specific. Otherwise, everything > will > > be > > > > > backwards compatibility as current API. The implementation of this > > > change > > > > > also seems to be very simple. We only need to plumb instance > specific > > > > tasks > > > > > configs when we call statemanager.insertPendingTasks in > > > > > SchedulerThriftInterface.createJob function. > > > > > > > > > > /** > > > > > * Description of an Aurora job. One task will be scheduled for > each > > > > > instance within the job. > > > > > */ > > > > > @@ -328,13 +343,17 @@ struct JobConfiguration { > > > > > 4: string cronSchedule > > > > > /** Collision policy to use when handling overlapping cron runs. > > > > > Default is KILL_EXISTING. */ > > > > > 5: CronCollisionPolicy cronCollisionPolicy > > > > > - /** Task configuration for this job. */ > > > > > + /** Default task configuration for all instances of this job. */ > > > > > 6: TaskConfig taskConfig > > > > > /** > > > > > * The number of instances in the job. Generated instance IDs > for > > > > tasks > > > > > will be in the range > > > > > * [0, instances). > > > > > */ > > > > > 8: i32 instanceCount > > > > > + /** > > > > > + * The instance specific task configs that override the default > > task > > > > > config for given > > > > > + * instanceId ranges. > > > > > + */ > > > > > + 10: optional set instanceTaskConfigs > > > > > } > > > > > > > > > > Please let us know your comments and suggestions. > > > > > > > > > > Thanks, - Min > > > > > > > > > > > > > > > --001a11446b90bbfb04053a1f3d1e--