aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Proposal: API changes to getTasksStatus
Date Thu, 19 Jun 2014 20:26:39 GMT
Thanks for chiming in. I played with this idea a bit more and I tend to
agree that thrift lookup filter may be an overkill here. TBase does not
provide a way to iterate over all available fields, which makes this design
even harder to implement.

Also, the complexity/gain tradeoff is not ideal here either. The heaviest
piece of data is TaskConfig.executorConfig. The second biggest is
TaskEvents but both UI and SLA commands need them. Everything else is more
of a noise in terms of data size. So, I guess the lightweight API (no
executorConfig) proposed originally might be the easiest solution that
should mostly satisfy both current use cases. It would also be the easiest
to throw away when/if we eventually get rid of the ExecutorConfig data
stored within TaskConfig object.



On Thu, Jun 19, 2014 at 12:49 PM, Bill Farner <wfarner@apache.org> wrote:

> My personal preference would be to identify the different use cases, and
> create API calls for those (i suspect there are relatively few).  I'm not
> fond of altering the thrift response schema on the fly based on request
> parameters (e.g. SELECT x FROM).  This makes for client and server code
> that is difficult to reason about and set preconditions around.
>
> We might also consider taking a more REST-like approach and using object
> references (e.g. rather than returning an embedded TaskConfig object,
> return instructions for fetching the TaskConfig).  This would allow the
> client to de-dupe and selectively fetch.
>
>
> -=Bill
>
>
> On Wed, Jun 18, 2014 at 5:10 PM, Maxim Khutornenko <maxim@apache.org>
> wrote:
>
> > Small correction. TBase does not support lookup by field names, so the
> > format would have to be a list of field IDs:
> >
> > // Defines a set of thrift paths to be populated when returning
> > ScheduledTasks.
> > // Single path format is a list of valid field IDs representing thrift
> > query path.
> > // Valid examples:
> > //    [4], - taskEvents
> > //    [1,6] - assignedTask.instanceId
> > //    [1,4,27] - assignedTask.task.metadata
> > // Invalid examples:
> > //    [10] - field does not exist.
> > //    [4,2] - collection value lookup is not supported.
> > struct ScheduledTaskFormat {
> >   1: set<list<i32>> fieldIds
> > }
> >
> > The client would have to support a name-to-id conversion if we still
> want a
> > readable format like "assignedTask.instanceId" (converts to [1,6]).
> >
> >
> > On Wed, Jun 18, 2014 at 1:21 PM, Maxim Khutornenko <maxim@apache.org>
> > wrote:
> >
> > > Bumping up this thread as pagination does not solve client cases where
> > all
> > > tasks must be pulled for evaluation. Example: SLA commands in
> > aurora_admin.
> > >
> > > I would like to explore the #2 idea proposed by David and subsequently
> > > outlined by Bill/Suman.
> > >
> > > Specifically, I am proposing to add an optional format:
> > > ScheduledTaskFormat parameter into the existing getTasksStatus() RPC:
> > >
> > > // Defines a set of fields to be populated
> > > // when returning ScheduledTasks. A format is
> > > // a string composed of valid struct field names
> > > // of an arbitrary depth separated by '.'
> > > // Valid examples:
> > > //    "taskEvents"
> > > //    "assignedTask.instanceId"
> > > //    "assignedTask.task.metadata"
> > > // Invalid examples:
> > > //    "task" - field does not exist.
> > > //    "assignedTask.assignedPorts.health" -
> > > //       collection value lookup is not supported.
> > > struct ScheduledTaskFormat {
> > >   1: set<string> fields
> > > }
> > >
> > > Given the above structure, the scheduler would be able to return
> > > ScheduledTask instances populated according to the desired format with
> > all
> > > other fields set to NULL.
> > >
> > > Alternatively, we could invert the format meaning from "include" to
> > > "exclude" and use it as an exclusion filter. However, I am leaning
> > towards
> > > the "include" approach as a better one from perf standpoint.
> > >
> > > There is also a possibility of defining a new RPC to support this
> > > filtering but I am not convinced it's worth the effort.
> > >
> > > Any comments/suggestions?
> > >
> > > Thanks,
> > > Maxim
> > >
> > >
> > > On Fri, May 30, 2014 at 3:43 PM, David McLaughlin <
> david@dmclaughlin.com
> > >
> > > wrote:
> > >
> > >> On Fri, May 30, 2014 at 11:22 AM, Suman Karumuri <mansu@apache.org>
> > >> wrote:
> > >>
> > >> > Looks like my earlier mail has bounced. Sending it again.
> > >> >
> > >> > Can you please add some explanation to AURORA-471 explaining where
> the
> > >> time
> > >> > is being spent with some data and some micro benchmarks. As is it
is
> > >> > unclear why thrift.flush should be taking so long. Is the default
> > >> > implementation of TServlet.doPost() doing something that can be
> > >> > optimized?If thrift serialization is really quick, where are we
> > spending
> > >> > the time? Or is thrift.js spending too much time parsing 10 megs of
> > >> JSON?
> > >> >
> > >>
> > >> I updated AURORA-458 with the jvisualvm profile which show that most
> > time
> > >> is spent in GzipStream.write. Bill and I had profiled like this
> before,
> > >> but
> > >> hadn't seen GzipStream be such an obvious cause of latency. When I
> also
> > >> tried to do some experiments with curl and removing compression, I
> > didn't
> > >> see any effect in latency so ruled out Gzip as a factor. Turns out
> this
> > is
> > >> because I was removing the Accept-Encoding header but not removing the
> > >> --compressed flag (which effectively sets the same header). When
> > removing
> > >> both, it becomes a lot clearer of the overhead of compressing
> responses
> > on
> > >> the fly (see ticket for details).
> > >>
> > >> I still think the uncompressed 10MB is going to be too slow though,
> > >> especially for clients with high latency to the scheduler host.
> > >>
> > >>
> > >>
> > >> > Pagination is a good idea. The only downside is that users would
> loose
> > >> the
> > >> > ability to sort and filter on the UI (this feature is to be
> > >> implemented).
> > >> > While pagination improves the time it takes to load a page, it
> doesn't
> > >> > necessarily improve the time it takes for a user to achieve his
> goal.
> > >> >
> > >>
> > >> Yeah, we would need to add sorting and filtering to the query.
> > >>
> > >>
> > >> >
> > >> > For this reason, I feel that we should drop the extra information
> from
> > >> the
> > >> > getTasksStatus first (if the client is unaffected by the change) to
> > >> reduce
> > >> > the page load time and see how much it improves the performance. 
To
> > >> > improve perceived performance, we can reduce the need to reload the
> > job
> > >> > page (by loading tasks in the back ground and client side caching
> > using
> > >> > $cache), since the only time the user will notice the delay is when
> > >> loading
> > >> > the page which is ok if it's few seconds after throwing out the
> > executor
> > >> > config.
> > >> >
> > >> >
> > >> Hmm, I don't see how $cache gets us anything. The user will want the
> > >> latest
> > >> information every time they hit the page.
> > >>
> > >>
> > >> > I think we should delay pagination as long as possible since
> operating
> > >> on
> > >> > the all the tasks in the UI with sorting, filtering and
> visualization
> > >> adds
> > >> > a lot of powerful capabilities. We can also add filtering and
> > searching
> > >> > capabilities in the client with a paginated API as well. But I feel
> > >> > delaying pagination would result in a quicker win instead since the
> > API
> > >> can
> > >> > be changed in a backwards compatible manner.
> > >> >
> > >> >
> > >> Pagination will be optional and totally backwards compatible, so they
> > can
> > >> still get summaries of all tasks by not supplying an offset or limit
> to
> > >> their query. If you drop information from the response however, now
> the
> > >> client can't do that at all.
> > >>
> > >>
> > >> >
> > >> > On Wed, May 28, 2014 at 4:42 AM, Mark Chu-Carroll <
> > >> mchucarroll@apache.org>
> > >> > wrote:
> > >> >
> > >> > > Great. +1 from me.
> > >> > >
> > >> > >
> > >> > > On Tue, May 27, 2014 at 7:39 PM, David McLaughlin <
> > >> david@dmclaughlin.com
> > >> > > >wrote:
> > >> > >
> > >> > > > Pagination would be a no-op to the client because it would
be
> > opt-in
> > >> > > only,
> > >> > > > so it would continue to fetch all the tasks in one request.
> > >> > > >
> > >> > > > But you raise a good point in that presumably the client
is also
> > >> going
> > >> > to
> > >> > > > be blocked for several seconds while executing getTasksStatus
> for
> > >> large
> > >> > > > jobs. Making the response more lightweight could be a big
win
> > there,
> > >> > but
> > >> > > I
> > >> > > > would need a better understanding of how the client is using
> those
> > >> > > > responses first.
> > >> > > >
> > >> > > >
> > >> > > > On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll <
> > >> > > mchucarroll@apache.org
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Interestingly, when we first expanded getTasksStatus,
I didn't
> > >> like
> > >> > the
> > >> > > > > idea, because I thought it would have exactly this
problem!
> > It's a
> > >> > > *lot*
> > >> > > > of
> > >> > > > > information to get in a single burst.
> > >> > > > >
> > >> > > > > Have you checked what effect it'll have on the command-line
> > >> client?
> > >> > In
> > >> > > > > general, the command-line has the context do a single
API
> call,
> > >> > gathers
> > >> > > > the
> > >> > > > > results, and returns to a command implementation. It'll
> > definitely
> > >> > > > > complicate things to add pagination.  How much of an
effect
> will
> > >> it
> > >> > be?
> > >> > > > >
> > >> > > > >    -Mark
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin <
> > >> > > david@dmclaughlin.com
> > >> > > > > >wrote:
> > >> > > > >
> > >> > > > > > As outlined in AURORA-458, using the new jobs
page with a
> > large
> > >> > (but
> > >> > > > > > reasonable) number of active and complete tasks
can take a
> > long
> > >> > > time[1]
> > >> > > > > to
> > >> > > > > > render. Performance profiling done as part of
AURORA-471
> shows
> > >> that
> > >> > > the
> > >> > > > > > main factor in response time is rendering and
returning the
> > >> size of
> > >> > > the
> > >> > > > > > uncompressed payload to the client.
> > >> > > > > >
> > >> > > > > > To that end, I think we have two approaches:
> > >> > > > > >
> > >> > > > > >  1) Add pagination to the getTasksStatus call.
> > >> > > > > >  2) Make the getTasksStatus response more lightweight.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Pagination
> > >> > > > > > ---------------
> > >> > > > > >
> > >> > > > > > Pagination would be the simplest approach, and
would scale
> to
> > >> > > > arbitrarily
> > >> > > > > > large numbers of tasks moving forward. The main
issue with
> > this
> > >> is
> > >> > > that
> > >> > > > > we
> > >> > > > > > need all active tasks to build the configuration
summary at
> > the
> > >> top
> > >> > > of
> > >> > > > > the
> > >> > > > > > job page.
> > >> > > > > >
> > >> > > > > > As a workaround we could add a new API call -
> > >> getTaskConfigSummary
> > >> > -
> > >> > > > > which
> > >> > > > > > returns something like:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > struct ConfigGroup {
> > >> > > > > >   1: TaskConfig config
> > >> > > > > >   2: set<i32> instanceIds
> > >> > > > > > }
> > >> > > > > >
> > >> > > > > > struct ConfigSummary {
> > >> > > > > >   1: JobKey jobKey
> > >> > > > > >   2: set<ConfigGroup> groups
> > >> > > > > > }
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > To support pagination without breaking the existing
API, we
> > >> could
> > >> > add
> > >> > > > > > offset and limit fields to the TaskQuery struct.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Make getTasksStatus more lightweight
> > >> > > > > > ------------------------------------
> > >> > > > > >
> > >> > > > > > getTasksStatus currently returns a list of ScheduledTask
> > >> instances.
> > >> > > The
> > >> > > > > > biggest (in terms of payload size) child object
of a
> > >> ScheduledTask
> > >> > is
> > >> > > > the
> > >> > > > > > TaskConfig struct, which itself contains an ExecutorConfig.
> > >> > > > > >
> > >> > > > > > I took a sample response from one of our internal
production
> > >> > > instances
> > >> > > > > and
> > >> > > > > > it turns out that around 65% of the total response
size was
> > for
> > >> > > > > > ExecutorConfig objects, and specifically the "cmdline"
> > property
> > >> of
> > >> > > > these.
> > >> > > > > > We currently do not use this information anywhere
in the UI
> > nor
> > >> do
> > >> > we
> > >> > > > > > inspect it when grouping taskConfigs, and it would
be a
> > >> relatively
> > >> > > easy
> > >> > > > > win
> > >> > > > > > to just drop these from the response.
> > >> > > > > >
> > >> > > > > > We'd still need this information for the config
grouping, so
> > we
> > >> > could
> > >> > > > add
> > >> > > > > > the response suggested for getTaskConfigSummary
as another
> > >> property
> > >> > > and
> > >> > > > > > allow the client to reconcile these objects if
it needs to:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > struct TaskStatusResponse {
> > >> > > > > >   1: list<LightweightTask> tasks
> > >> > > > > >   2: set<ConfigGroup> configSummary
> > >> > > > > > }
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > This would significantly reduce the uncompressed
payload
> size
> > >> while
> > >> > > > still
> > >> > > > > > containing the same data.
> > >> > > > > >
> > >> > > > > > However, there is still a potentially significant
part of a
> > >> payload
> > >> > > > size
> > >> > > > > > remaining: task events (and these *are* currently
used in
> the
> > >> UI).
> > >> > We
> > >> > > > > could
> > >> > > > > > solve this by dropping task events from the LightweightTask
> > >> struct
> > >> > > too,
> > >> > > > > and
> > >> > > > > > fetching them lazily in batches.
> > >> > > > > >
> > >> > > > > > i.e. an API call like:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > getTaskEvents(1: JobKey key, 2: set<i32>
instanceIds)
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Could return:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > struct TaskEventResult {
> > >> > > > > >   1: i32 instanceId
> > >> > > > > >   2: list<TaskEvent> taskEvents
> > >> > > > > > }
> > >> > > > > >
> > >> > > > > > struct TaskEventResponse {
> > >> > > > > >   1: JobKey key
> > >> > > > > >   2: list<TaskEventResult> results
> > >> > > > > > }
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Events could then only be fetched and rendered
as the user
> > >> clicks
> > >> > > > through
> > >> > > > > > the pages of tasks.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Proposal
> > >> > > > > > -------------
> > >> > > > > >
> > >> > > > > > I think pagination makes more sense here. It adds
moderate
> > >> overhead
> > >> > > to
> > >> > > > > the
> > >> > > > > > complexity of the UI (this is purely due to our
use of
> > >> smart-table
> > >> > > > which
> > >> > > > > is
> > >> > > > > > not so server-side pagination friendly) but the
client logic
> > >> would
> > >> > > > > actually
> > >> > > > > > be simpler with the new getTaskConfigSummary api
call.
> > >> > > > > >
> > >> > > > > > I do think there is value in considering whether
the
> > >> ScheduledTask
> > >> > > > struct
> > >> > > > > > needs to contain all of the information it does
- but this
> > >> could be
> > >> > > > done
> > >> > > > > as
> > >> > > > > > part of a separate or complimentary performance
improvement
> > >> ticket.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > [1] - At Twitter we observed 2000 active + 100
finished
> tasks
> > >> > having
> > >> > > a
> > >> > > > > > payload size of 10MB which took 8~10 seconds to
complete.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message