aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Proposal: API changes to getTasksStatus
Date Wed, 18 Jun 2014 20:21:28 GMT
Bumping up this thread as pagination does not solve client cases where all
tasks must be pulled for evaluation. Example: SLA commands in aurora_admin.

I would like to explore the #2 idea proposed by David and subsequently
outlined by Bill/Suman.

Specifically, I am proposing to add an optional format: ScheduledTaskFormat
parameter into the existing getTasksStatus() RPC:

// Defines a set of fields to be populated
// when returning ScheduledTasks. A format is
// a string composed of valid struct field names
// of an arbitrary depth separated by '.'
// Valid examples:
//    "taskEvents"
//    "assignedTask.instanceId"
//    "assignedTask.task.metadata"
// Invalid examples:
//    "task" - field does not exist.
//    "assignedTask.assignedPorts.health" -
//       collection value lookup is not supported.
struct ScheduledTaskFormat {
  1: set<string> fields
}

Given the above structure, the scheduler would be able to return
ScheduledTask instances populated according to the desired format with all
other fields set to NULL.

Alternatively, we could invert the format meaning from "include" to
"exclude" and use it as an exclusion filter. However, I am leaning towards
the "include" approach as a better one from perf standpoint.

There is also a possibility of defining a new RPC to support this filtering
but I am not convinced it's worth the effort.

Any comments/suggestions?

Thanks,
Maxim


On Fri, May 30, 2014 at 3:43 PM, David McLaughlin <david@dmclaughlin.com>
wrote:

> On Fri, May 30, 2014 at 11:22 AM, Suman Karumuri <mansu@apache.org> wrote:
>
> > Looks like my earlier mail has bounced. Sending it again.
> >
> > Can you please add some explanation to AURORA-471 explaining where the
> time
> > is being spent with some data and some micro benchmarks. As is it is
> > unclear why thrift.flush should be taking so long. Is the default
> > implementation of TServlet.doPost() doing something that can be
> > optimized?If thrift serialization is really quick, where are we spending
> > the time? Or is thrift.js spending too much time parsing 10 megs of JSON?
> >
>
> I updated AURORA-458 with the jvisualvm profile which show that most time
> is spent in GzipStream.write. Bill and I had profiled like this before, but
> hadn't seen GzipStream be such an obvious cause of latency. When I also
> tried to do some experiments with curl and removing compression, I didn't
> see any effect in latency so ruled out Gzip as a factor. Turns out this is
> because I was removing the Accept-Encoding header but not removing the
> --compressed flag (which effectively sets the same header). When removing
> both, it becomes a lot clearer of the overhead of compressing responses on
> the fly (see ticket for details).
>
> I still think the uncompressed 10MB is going to be too slow though,
> especially for clients with high latency to the scheduler host.
>
>
>
> > Pagination is a good idea. The only downside is that users would loose
> the
> > ability to sort and filter on the UI (this feature is to be implemented).
> > While pagination improves the time it takes to load a page, it doesn't
> > necessarily improve the time it takes for a user to achieve his goal.
> >
>
> Yeah, we would need to add sorting and filtering to the query.
>
>
> >
> > For this reason, I feel that we should drop the extra information from
> the
> > getTasksStatus first (if the client is unaffected by the change) to
> reduce
> > the page load time and see how much it improves the performance.  To
> > improve perceived performance, we can reduce the need to reload the job
> > page (by loading tasks in the back ground and client side caching using
> > $cache), since the only time the user will notice the delay is when
> loading
> > the page which is ok if it's few seconds after throwing out the executor
> > config.
> >
> >
> Hmm, I don't see how $cache gets us anything. The user will want the latest
> information every time they hit the page.
>
>
> > I think we should delay pagination as long as possible since operating on
> > the all the tasks in the UI with sorting, filtering and visualization
> adds
> > a lot of powerful capabilities. We can also add filtering and searching
> > capabilities in the client with a paginated API as well. But I feel
> > delaying pagination would result in a quicker win instead since the API
> can
> > be changed in a backwards compatible manner.
> >
> >
> Pagination will be optional and totally backwards compatible, so they can
> still get summaries of all tasks by not supplying an offset or limit to
> their query. If you drop information from the response however, now the
> client can't do that at all.
>
>
> >
> > On Wed, May 28, 2014 at 4:42 AM, Mark Chu-Carroll <
> mchucarroll@apache.org>
> > wrote:
> >
> > > Great. +1 from me.
> > >
> > >
> > > On Tue, May 27, 2014 at 7:39 PM, David McLaughlin <
> david@dmclaughlin.com
> > > >wrote:
> > >
> > > > Pagination would be a no-op to the client because it would be opt-in
> > > only,
> > > > so it would continue to fetch all the tasks in one request.
> > > >
> > > > But you raise a good point in that presumably the client is also
> going
> > to
> > > > be blocked for several seconds while executing getTasksStatus for
> large
> > > > jobs. Making the response more lightweight could be a big win there,
> > but
> > > I
> > > > would need a better understanding of how the client is using those
> > > > responses first.
> > > >
> > > >
> > > > On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll <
> > > mchucarroll@apache.org
> > > > >wrote:
> > > >
> > > > > Interestingly, when we first expanded getTasksStatus, I didn't like
> > the
> > > > > idea, because I thought it would have exactly this problem! It's
a
> > > *lot*
> > > > of
> > > > > information to get in a single burst.
> > > > >
> > > > > Have you checked what effect it'll have on the command-line client?
> > In
> > > > > general, the command-line has the context do a single API call,
> > gathers
> > > > the
> > > > > results, and returns to a command implementation. It'll definitely
> > > > > complicate things to add pagination.  How much of an effect will
it
> > be?
> > > > >
> > > > >    -Mark
> > > > >
> > > > >
> > > > >
> > > > > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin <
> > > david@dmclaughlin.com
> > > > > >wrote:
> > > > >
> > > > > > As outlined in AURORA-458, using the new jobs page with a large
> > (but
> > > > > > reasonable) number of active and complete tasks can take a long
> > > time[1]
> > > > > to
> > > > > > render. Performance profiling done as part of AURORA-471 shows
> that
> > > the
> > > > > > main factor in response time is rendering and returning the
size
> of
> > > the
> > > > > > uncompressed payload to the client.
> > > > > >
> > > > > > To that end, I think we have two approaches:
> > > > > >
> > > > > >  1) Add pagination to the getTasksStatus call.
> > > > > >  2) Make the getTasksStatus response more lightweight.
> > > > > >
> > > > > >
> > > > > > Pagination
> > > > > > ---------------
> > > > > >
> > > > > > Pagination would be the simplest approach, and would scale to
> > > > arbitrarily
> > > > > > large numbers of tasks moving forward. The main issue with this
> is
> > > that
> > > > > we
> > > > > > need all active tasks to build the configuration summary at
the
> top
> > > of
> > > > > the
> > > > > > job page.
> > > > > >
> > > > > > As a workaround we could add a new API call -
> getTaskConfigSummary
> > -
> > > > > which
> > > > > > returns something like:
> > > > > >
> > > > > >
> > > > > > struct ConfigGroup {
> > > > > >   1: TaskConfig config
> > > > > >   2: set<i32> instanceIds
> > > > > > }
> > > > > >
> > > > > > struct ConfigSummary {
> > > > > >   1: JobKey jobKey
> > > > > >   2: set<ConfigGroup> groups
> > > > > > }
> > > > > >
> > > > > >
> > > > > > To support pagination without breaking the existing API, we
could
> > add
> > > > > > offset and limit fields to the TaskQuery struct.
> > > > > >
> > > > > >
> > > > > > Make getTasksStatus more lightweight
> > > > > > ------------------------------------
> > > > > >
> > > > > > getTasksStatus currently returns a list of ScheduledTask
> instances.
> > > The
> > > > > > biggest (in terms of payload size) child object of a
> ScheduledTask
> > is
> > > > the
> > > > > > TaskConfig struct, which itself contains an ExecutorConfig.
> > > > > >
> > > > > > I took a sample response from one of our internal production
> > > instances
> > > > > and
> > > > > > it turns out that around 65% of the total response size was
for
> > > > > > ExecutorConfig objects, and specifically the "cmdline" property
> of
> > > > these.
> > > > > > We currently do not use this information anywhere in the UI
nor
> do
> > we
> > > > > > inspect it when grouping taskConfigs, and it would be a
> relatively
> > > easy
> > > > > win
> > > > > > to just drop these from the response.
> > > > > >
> > > > > > We'd still need this information for the config grouping, so
we
> > could
> > > > add
> > > > > > the response suggested for getTaskConfigSummary as another
> property
> > > and
> > > > > > allow the client to reconcile these objects if it needs to:
> > > > > >
> > > > > >
> > > > > > struct TaskStatusResponse {
> > > > > >   1: list<LightweightTask> tasks
> > > > > >   2: set<ConfigGroup> configSummary
> > > > > > }
> > > > > >
> > > > > >
> > > > > > This would significantly reduce the uncompressed payload size
> while
> > > > still
> > > > > > containing the same data.
> > > > > >
> > > > > > However, there is still a potentially significant part of a
> payload
> > > > size
> > > > > > remaining: task events (and these *are* currently used in the
> UI).
> > We
> > > > > could
> > > > > > solve this by dropping task events from the LightweightTask
> struct
> > > too,
> > > > > and
> > > > > > fetching them lazily in batches.
> > > > > >
> > > > > > i.e. an API call like:
> > > > > >
> > > > > >
> > > > > > getTaskEvents(1: JobKey key, 2: set<i32> instanceIds)
> > > > > >
> > > > > >
> > > > > > Could return:
> > > > > >
> > > > > >
> > > > > > struct TaskEventResult {
> > > > > >   1: i32 instanceId
> > > > > >   2: list<TaskEvent> taskEvents
> > > > > > }
> > > > > >
> > > > > > struct TaskEventResponse {
> > > > > >   1: JobKey key
> > > > > >   2: list<TaskEventResult> results
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Events could then only be fetched and rendered as the user clicks
> > > > through
> > > > > > the pages of tasks.
> > > > > >
> > > > > >
> > > > > > Proposal
> > > > > > -------------
> > > > > >
> > > > > > I think pagination makes more sense here. It adds moderate
> overhead
> > > to
> > > > > the
> > > > > > complexity of the UI (this is purely due to our use of
> smart-table
> > > > which
> > > > > is
> > > > > > not so server-side pagination friendly) but the client logic
> would
> > > > > actually
> > > > > > be simpler with the new getTaskConfigSummary api call.
> > > > > >
> > > > > > I do think there is value in considering whether the
> ScheduledTask
> > > > struct
> > > > > > needs to contain all of the information it does - but this could
> be
> > > > done
> > > > > as
> > > > > > part of a separate or complimentary performance improvement
> ticket.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1] - At Twitter we observed 2000 active + 100 finished tasks
> > having
> > > a
> > > > > > payload size of 10MB which took 8~10 seconds to complete.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message