aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Chu-Carroll <mchucarr...@apache.org>
Subject Re: Proposal: API changes to getTasksStatus
Date Wed, 28 May 2014 11:42:07 GMT
Great. +1 from me.


On Tue, May 27, 2014 at 7:39 PM, David McLaughlin <david@dmclaughlin.com>wrote:

> Pagination would be a no-op to the client because it would be opt-in only,
> so it would continue to fetch all the tasks in one request.
>
> But you raise a good point in that presumably the client is also going to
> be blocked for several seconds while executing getTasksStatus for large
> jobs. Making the response more lightweight could be a big win there, but I
> would need a better understanding of how the client is using those
> responses first.
>
>
> On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll <mchucarroll@apache.org
> >wrote:
>
> > Interestingly, when we first expanded getTasksStatus, I didn't like the
> > idea, because I thought it would have exactly this problem! It's a *lot*
> of
> > information to get in a single burst.
> >
> > Have you checked what effect it'll have on the command-line client? In
> > general, the command-line has the context do a single API call, gathers
> the
> > results, and returns to a command implementation. It'll definitely
> > complicate things to add pagination.  How much of an effect will it be?
> >
> >    -Mark
> >
> >
> >
> > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin <david@dmclaughlin.com
> > >wrote:
> >
> > > As outlined in AURORA-458, using the new jobs page with a large (but
> > > reasonable) number of active and complete tasks can take a long time[1]
> > to
> > > render. Performance profiling done as part of AURORA-471 shows that the
> > > main factor in response time is rendering and returning the size of the
> > > uncompressed payload to the client.
> > >
> > > To that end, I think we have two approaches:
> > >
> > >  1) Add pagination to the getTasksStatus call.
> > >  2) Make the getTasksStatus response more lightweight.
> > >
> > >
> > > Pagination
> > > ---------------
> > >
> > > Pagination would be the simplest approach, and would scale to
> arbitrarily
> > > large numbers of tasks moving forward. The main issue with this is that
> > we
> > > need all active tasks to build the configuration summary at the top of
> > the
> > > job page.
> > >
> > > As a workaround we could add a new API call - getTaskConfigSummary -
> > which
> > > returns something like:
> > >
> > >
> > > struct ConfigGroup {
> > >   1: TaskConfig config
> > >   2: set<i32> instanceIds
> > > }
> > >
> > > struct ConfigSummary {
> > >   1: JobKey jobKey
> > >   2: set<ConfigGroup> groups
> > > }
> > >
> > >
> > > To support pagination without breaking the existing API, we could add
> > > offset and limit fields to the TaskQuery struct.
> > >
> > >
> > > Make getTasksStatus more lightweight
> > > ------------------------------------
> > >
> > > getTasksStatus currently returns a list of ScheduledTask instances. The
> > > biggest (in terms of payload size) child object of a ScheduledTask is
> the
> > > TaskConfig struct, which itself contains an ExecutorConfig.
> > >
> > > I took a sample response from one of our internal production instances
> > and
> > > it turns out that around 65% of the total response size was for
> > > ExecutorConfig objects, and specifically the "cmdline" property of
> these.
> > > We currently do not use this information anywhere in the UI nor do we
> > > inspect it when grouping taskConfigs, and it would be a relatively easy
> > win
> > > to just drop these from the response.
> > >
> > > We'd still need this information for the config grouping, so we could
> add
> > > the response suggested for getTaskConfigSummary as another property and
> > > allow the client to reconcile these objects if it needs to:
> > >
> > >
> > > struct TaskStatusResponse {
> > >   1: list<LightweightTask> tasks
> > >   2: set<ConfigGroup> configSummary
> > > }
> > >
> > >
> > > This would significantly reduce the uncompressed payload size while
> still
> > > containing the same data.
> > >
> > > However, there is still a potentially significant part of a payload
> size
> > > remaining: task events (and these *are* currently used in the UI). We
> > could
> > > solve this by dropping task events from the LightweightTask struct too,
> > and
> > > fetching them lazily in batches.
> > >
> > > i.e. an API call like:
> > >
> > >
> > > getTaskEvents(1: JobKey key, 2: set<i32> instanceIds)
> > >
> > >
> > > Could return:
> > >
> > >
> > > struct TaskEventResult {
> > >   1: i32 instanceId
> > >   2: list<TaskEvent> taskEvents
> > > }
> > >
> > > struct TaskEventResponse {
> > >   1: JobKey key
> > >   2: list<TaskEventResult> results
> > > }
> > >
> > >
> > > Events could then only be fetched and rendered as the user clicks
> through
> > > the pages of tasks.
> > >
> > >
> > > Proposal
> > > -------------
> > >
> > > I think pagination makes more sense here. It adds moderate overhead to
> > the
> > > complexity of the UI (this is purely due to our use of smart-table
> which
> > is
> > > not so server-side pagination friendly) but the client logic would
> > actually
> > > be simpler with the new getTaskConfigSummary api call.
> > >
> > > I do think there is value in considering whether the ScheduledTask
> struct
> > > needs to contain all of the information it does - but this could be
> done
> > as
> > > part of a separate or complimentary performance improvement ticket.
> > >
> > >
> > >
> > >
> > > [1] - At Twitter we observed 2000 active + 100 finished tasks having a
> > > payload size of 10MB which took 8~10 seconds to complete.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message