Return-Path: X-Original-To: apmail-aurora-dev-archive@minotaur.apache.org Delivered-To: apmail-aurora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 890401025A for ; Fri, 30 May 2014 18:28:19 +0000 (UTC) Received: (qmail 49777 invoked by uid 500); 30 May 2014 18:28:19 -0000 Delivered-To: apmail-aurora-dev-archive@aurora.apache.org Received: (qmail 49724 invoked by uid 500); 30 May 2014 18:28:19 -0000 Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.incubator.apache.org Delivered-To: mailing list dev@aurora.incubator.apache.org Received: (qmail 49716 invoked by uid 99); 30 May 2014 18:28:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 May 2014 18:28:19 +0000 X-ASF-Spam-Status: No, hits=-1998.5 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 30 May 2014 18:28:18 +0000 Received: (qmail 49542 invoked by uid 99); 30 May 2014 18:27:52 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 May 2014 18:27:52 +0000 Received: from localhost (HELO mail-vc0-f174.google.com) (127.0.0.1) (smtp-auth username wfarner, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 May 2014 18:27:52 +0000 Received: by mail-vc0-f174.google.com with SMTP id ik5so2337327vcb.19 for ; Fri, 30 May 2014 11:27:51 -0700 (PDT) X-Gm-Message-State: ALoCoQnlGakBkygCdqQcTeVqybgdeqQxRc5bGK3sbxKZzr6xOXtueY8vAQCRN1Se8zq2yuXcPZQ+ MIME-Version: 1.0 X-Received: by 10.52.35.46 with SMTP id e14mr2067722vdj.94.1401474471415; Fri, 30 May 2014 11:27:51 -0700 (PDT) Received: by 10.58.59.101 with HTTP; Fri, 30 May 2014 11:27:51 -0700 (PDT) In-Reply-To: References: Date: Fri, 30 May 2014 11:27:51 -0700 Message-ID: Subject: Re: Proposal: API changes to getTasksStatus From: Bill Farner To: "dev@aurora.incubator.apache.org" Content-Type: multipart/alternative; boundary=20cf30780cb007ebef04faa2347c X-Virus-Checked: Checked by ClamAV on apache.org --20cf30780cb007ebef04faa2347c Content-Type: text/plain; charset=UTF-8 > > I feel that we should drop the extra information from the > getTasksStatus first (if the client is unaffected by the change) The client would indeed be affected by that (config diffing when performing job updates). However, there's nothing stopping us from defining a new RPC and message to define exactly the fields we want to send for the UI. -=Bill On Fri, May 30, 2014 at 11:22 AM, Suman Karumuri wrote: > Looks like my earlier mail has bounced. Sending it again. > > Can you please add some explanation to AURORA-471 explaining where the time > is being spent with some data and some micro benchmarks. As is it is > unclear why thrift.flush should be taking so long. Is the default > implementation of TServlet.doPost() doing something that can be > optimized?If thrift serialization is really quick, where are we spending > the time? Or is thrift.js spending too much time parsing 10 megs of JSON? > > Pagination is a good idea. The only downside is that users would loose the > ability to sort and filter on the UI (this feature is to be implemented). > While pagination improves the time it takes to load a page, it doesn't > necessarily improve the time it takes for a user to achieve his goal. > > For this reason, I feel that we should drop the extra information from the > getTasksStatus first (if the client is unaffected by the change) to reduce > the page load time and see how much it improves the performance. To > improve perceived performance, we can reduce the need to reload the job > page (by loading tasks in the back ground and client side caching using > $cache), since the only time the user will notice the delay is when loading > the page which is ok if it's few seconds after throwing out the executor > config. > > I think we should delay pagination as long as possible since operating on > the all the tasks in the UI with sorting, filtering and visualization adds > a lot of powerful capabilities. We can also add filtering and searching > capabilities in the client with a paginated API as well. But I feel > delaying pagination would result in a quicker win instead since the API can > be changed in a backwards compatible manner. > > > On Wed, May 28, 2014 at 4:42 AM, Mark Chu-Carroll > wrote: > > > Great. +1 from me. > > > > > > On Tue, May 27, 2014 at 7:39 PM, David McLaughlin > >wrote: > > > > > Pagination would be a no-op to the client because it would be opt-in > > only, > > > so it would continue to fetch all the tasks in one request. > > > > > > But you raise a good point in that presumably the client is also going > to > > > be blocked for several seconds while executing getTasksStatus for large > > > jobs. Making the response more lightweight could be a big win there, > but > > I > > > would need a better understanding of how the client is using those > > > responses first. > > > > > > > > > On Tue, May 27, 2014 at 3:34 PM, Mark Chu-Carroll < > > mchucarroll@apache.org > > > >wrote: > > > > > > > Interestingly, when we first expanded getTasksStatus, I didn't like > the > > > > idea, because I thought it would have exactly this problem! It's a > > *lot* > > > of > > > > information to get in a single burst. > > > > > > > > Have you checked what effect it'll have on the command-line client? > In > > > > general, the command-line has the context do a single API call, > gathers > > > the > > > > results, and returns to a command implementation. It'll definitely > > > > complicate things to add pagination. How much of an effect will it > be? > > > > > > > > -Mark > > > > > > > > > > > > > > > > On Tue, May 27, 2014 at 5:32 PM, David McLaughlin < > > david@dmclaughlin.com > > > > >wrote: > > > > > > > > > As outlined in AURORA-458, using the new jobs page with a large > (but > > > > > reasonable) number of active and complete tasks can take a long > > time[1] > > > > to > > > > > render. Performance profiling done as part of AURORA-471 shows that > > the > > > > > main factor in response time is rendering and returning the size of > > the > > > > > uncompressed payload to the client. > > > > > > > > > > To that end, I think we have two approaches: > > > > > > > > > > 1) Add pagination to the getTasksStatus call. > > > > > 2) Make the getTasksStatus response more lightweight. > > > > > > > > > > > > > > > Pagination > > > > > --------------- > > > > > > > > > > Pagination would be the simplest approach, and would scale to > > > arbitrarily > > > > > large numbers of tasks moving forward. The main issue with this is > > that > > > > we > > > > > need all active tasks to build the configuration summary at the top > > of > > > > the > > > > > job page. > > > > > > > > > > As a workaround we could add a new API call - getTaskConfigSummary > - > > > > which > > > > > returns something like: > > > > > > > > > > > > > > > struct ConfigGroup { > > > > > 1: TaskConfig config > > > > > 2: set instanceIds > > > > > } > > > > > > > > > > struct ConfigSummary { > > > > > 1: JobKey jobKey > > > > > 2: set groups > > > > > } > > > > > > > > > > > > > > > To support pagination without breaking the existing API, we could > add > > > > > offset and limit fields to the TaskQuery struct. > > > > > > > > > > > > > > > Make getTasksStatus more lightweight > > > > > ------------------------------------ > > > > > > > > > > getTasksStatus currently returns a list of ScheduledTask instances. > > The > > > > > biggest (in terms of payload size) child object of a ScheduledTask > is > > > the > > > > > TaskConfig struct, which itself contains an ExecutorConfig. > > > > > > > > > > I took a sample response from one of our internal production > > instances > > > > and > > > > > it turns out that around 65% of the total response size was for > > > > > ExecutorConfig objects, and specifically the "cmdline" property of > > > these. > > > > > We currently do not use this information anywhere in the UI nor do > we > > > > > inspect it when grouping taskConfigs, and it would be a relatively > > easy > > > > win > > > > > to just drop these from the response. > > > > > > > > > > We'd still need this information for the config grouping, so we > could > > > add > > > > > the response suggested for getTaskConfigSummary as another property > > and > > > > > allow the client to reconcile these objects if it needs to: > > > > > > > > > > > > > > > struct TaskStatusResponse { > > > > > 1: list tasks > > > > > 2: set configSummary > > > > > } > > > > > > > > > > > > > > > This would significantly reduce the uncompressed payload size while > > > still > > > > > containing the same data. > > > > > > > > > > However, there is still a potentially significant part of a payload > > > size > > > > > remaining: task events (and these *are* currently used in the UI). > We > > > > could > > > > > solve this by dropping task events from the LightweightTask struct > > too, > > > > and > > > > > fetching them lazily in batches. > > > > > > > > > > i.e. an API call like: > > > > > > > > > > > > > > > getTaskEvents(1: JobKey key, 2: set instanceIds) > > > > > > > > > > > > > > > Could return: > > > > > > > > > > > > > > > struct TaskEventResult { > > > > > 1: i32 instanceId > > > > > 2: list taskEvents > > > > > } > > > > > > > > > > struct TaskEventResponse { > > > > > 1: JobKey key > > > > > 2: list results > > > > > } > > > > > > > > > > > > > > > Events could then only be fetched and rendered as the user clicks > > > through > > > > > the pages of tasks. > > > > > > > > > > > > > > > Proposal > > > > > ------------- > > > > > > > > > > I think pagination makes more sense here. It adds moderate overhead > > to > > > > the > > > > > complexity of the UI (this is purely due to our use of smart-table > > > which > > > > is > > > > > not so server-side pagination friendly) but the client logic would > > > > actually > > > > > be simpler with the new getTaskConfigSummary api call. > > > > > > > > > > I do think there is value in considering whether the ScheduledTask > > > struct > > > > > needs to contain all of the information it does - but this could be > > > done > > > > as > > > > > part of a separate or complimentary performance improvement ticket. > > > > > > > > > > > > > > > > > > > > > > > > > [1] - At Twitter we observed 2000 active + 100 finished tasks > having > > a > > > > > payload size of 10MB which took 8~10 seconds to complete. > > > > > > > > > > > > > > > --20cf30780cb007ebef04faa2347c--