Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 922CF200B41 for ; Thu, 23 Jun 2016 07:38:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 90D5C160A64; Thu, 23 Jun 2016 05:38:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D77BB160A36 for ; Thu, 23 Jun 2016 07:38:10 +0200 (CEST) Received: (qmail 84139 invoked by uid 500); 23 Jun 2016 05:38:10 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 84127 invoked by uid 99); 23 Jun 2016 05:38:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2016 05:38:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4B1441A0C79 for ; Thu, 23 Jun 2016 05:38:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id ePwIn6G9VFvg for ; Thu, 23 Jun 2016 05:38:07 +0000 (UTC) Received: from mail-lf0-f49.google.com (mail-lf0-f49.google.com [209.85.215.49]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 9743B5F4E5 for ; Thu, 23 Jun 2016 05:38:06 +0000 (UTC) Received: by mail-lf0-f49.google.com with SMTP id l188so87090906lfe.2 for ; Wed, 22 Jun 2016 22:38:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=zNCcVScMGVDO+lfaixWL9fYEHbkrGuJ8KUnkCx/f1og=; b=UIBvkE+n19PyYg609OoK7Ws1eKIsSBfO+ug26LOsak9mxTLTk98u+k9Lluxd7rMPTy vTJlQ426nhxh168LEMJB3dJ2c8nl2R5Beg05DodgK9yPNYyGvUswpQQ9VhN+oXoxTbC2 x+POmYSS5WQM3W7UUNUJBEKCNPZrOfDZUQyY+ivLRXsgsZMtU1qukZ/YA7zjWDbvJNEL N4DZEAra4yYpGcuDX93GnBp448BJxO9q29tghPs06O23iOH7PGlk0cdY+iywPlNq3RD8 NMLo+apJ6UC5I3v4nmRloTblPFwZu+xU+A/oQBH2MWTWJ0Vcrl5rqpHF48eIERghb3Q9 cQog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=zNCcVScMGVDO+lfaixWL9fYEHbkrGuJ8KUnkCx/f1og=; b=jtm0fZyP+knCDFL8H7mgLEgflRFL0gn49sc6RYoMeWCq3tczS9+r5PW9BRjqsM9e+1 +8m4+z39vXN+Adl6SvHRQe3dicSkz1hxoNEMKmQdpfC3GynAAXCnwsKn7TNxhA54KS6L /7E3boO+cTf1Fp0DMjn+w53zsOdSxy1c6H5Y7LrbSXINXq1rU9jf4dzJ18IB5d+JMIK1 Rs/qaAOJiOERMK8MfCKds6EknC5BaZupRkpeI3f1IDhUcf17kxuVPGVGKD6xkMDEaHWC xIgTSa0iR5TNnQZ1WSGNSHm8EJ6cLeUL3m4eHbKDzMygQ0NKrWjL5MEkcn9kPb3uQFWp jYag== X-Gm-Message-State: ALyK8tIS1K9gS5GKYgliFcJDlDdSg1W6/2PRONxRTY2tADNFkL9wNjv+ue1hfKXhyiy14Zf1hnerRWzO6HasfA== X-Received: by 10.25.84.65 with SMTP id i62mr11624396lfb.88.1466660284532; Wed, 22 Jun 2016 22:38:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.65.2 with HTTP; Wed, 22 Jun 2016 22:38:03 -0700 (PDT) In-Reply-To: References: <5f4ae285-bbbc-6050-9931-93185fd68a60@weimo.de> From: John Yang Date: Wed, 22 Jun 2016 23:38:03 -0600 Message-ID: Subject: Re: Question about failure scenarios To: dev@reef.apache.org Content-Type: multipart/alternative; boundary=001a11411f28440f180535eb7510 archived-at: Thu, 23 Jun 2016 05:38:11 -0000 --001a11411f28440f180535eb7510 Content-Type: text/plain; charset=UTF-8 Julia, with regard to the REEF Task submission, here's how Vortex goes on about it. - Upon AllocatedEvaluatorHandler, Vortex submits a REEF Task without any bookkeeping. (Note that Vortex's REEF Task itself does not do anything useful itself upon instantiation; It only waits for the Driver to schedule tasklets onto it.) - *(If FailedEvaluatorHandler is invoked here)* We request for a new Evaluator. FailedEvaluator#getFailedTask returns null, and is thus ignored. - RunningTaskHandler is invoked and we memorize the task's id. (At this point of time, Vortex considers the Evaluator to be ready have tasklets scheduled onto) - *(If FailedEvaluatorHandler is invoked here) *We request for a new Evaluator and obtain the task id from FailedEvaluator#getFailedTask to un-memorize it Note that we rely on the fact that Driver uses a dedicated thread per Evaluator for processing their incoming events, which guarantees the invocations of their handlers to be serialized. Because of this, I believe RunningTaskHandler is never invoked after FailedEvaluatorHandler, and thus no garbage task ids will be created. Thanks, John On Wed, Jun 22, 2016 at 10:42 PM, John Yang wrote: > Hi, > > > Things are the same with the Mesos runtime. It receives resource offers > from Mesos. If the offers do not satisfy the resource requests from the > REEF user, it declines the offers, in order to receive new offers. It keeps > at it until it gets the right offers, which might never come. > > The best solution I can think of right now is similar to Markus's. You > have to somehow identify this behavior at the REEF-user-level based on your > application semantics. If we were to have this functionality at the RM > runtime-level(reef-runtime-yarn, reef-runtime-mesos, etc), I guess we can > make a set of standard configuration parameters to be used in each runtime. > However I'm not sure the benefits outweigh the added complexity, since the > resource capacity/availability of a cluster is usually considered > non-deterministic. > > > Thanks, > John > > > On Wed, Jun 22, 2016 at 2:58 PM, Markus Weimer wrote: > >> On 2016-06-22 13:31, Tobin Baker wrote: >> >>> My experience on the Java side has been that if insufficient >>> resources are available to allocate all requested Evaluators, the RM >>> will just silently keep retrying forever, with no exceptions thrown >>> by YARN or REEF. >>> >> >> That is correct. YARN does not, presently, give "no" as an answer to a >> unsatisfiable resource request. The only way I know to guard against it >> is to set a timer and to give up if the needed containers can't be >> acquired within the timeout. >> >> Markus >> > > --001a11411f28440f180535eb7510--