Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 87B1F10F83 for ; Wed, 5 Mar 2014 00:55:38 +0000 (UTC) Received: (qmail 20367 invoked by uid 500); 5 Mar 2014 00:55:37 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 20284 invoked by uid 500); 5 Mar 2014 00:55:37 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 20276 invoked by uid 99); 5 Mar 2014 00:55:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 00:55:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.192.41] (HELO mail-qg0-f41.google.com) (209.85.192.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2014 00:55:32 +0000 Received: by mail-qg0-f41.google.com with SMTP id i50so959670qgf.0 for ; Tue, 04 Mar 2014 16:55:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=R4K6l9tK+141A4KBZjF53uqPHjfkVz4hnRJQi9zh/3o=; b=V9DwSHrWb/H8fJAW63sTGUB1HZsblk4m1cOnc8Ufiia8DhRT0epeS3z8heIAYn58zj 14BCeEYfyMnpkA1LrehtwcG2bURuPVU8XkFQNsP4pfhFCS4k7c1z3IhhnogTFtQhvlb4 3NzWvCModvw7Hg9Q/LAr2GwKuiFRQEsSmOFwG53twyTwrRRcrP5Xm1HrAVdSdPjFvMeZ taBwoslN0HSwt471L0mKZPYQ3Udp3cPpHbDIPEuSxJxfNXisHE8f+bfxzJRefn71Qxgg K5tCg1FLfGQtylTQcxWji2IspSYGP1EMiXvG8e7hGghgoT1SG2Ls77d3rdz6/bJBJH7H Ld6A== X-Gm-Message-State: ALoCoQmXG8EaLS88TxoKs6tT1fVEl9DI+H/nh5ECi3PytnLemlRZ1mTXGnwr/36ntRp+NNNnwFPE MIME-Version: 1.0 X-Received: by 10.229.171.8 with SMTP id f8mr3415867qcz.13.1393980911077; Tue, 04 Mar 2014 16:55:11 -0800 (PST) Received: by 10.96.91.71 with HTTP; Tue, 4 Mar 2014 16:55:11 -0800 (PST) In-Reply-To: References: Date: Tue, 4 Mar 2014 16:55:11 -0800 Message-ID: Subject: Re: a question about fault recovery From: Reynold Xin To: dev@spark.apache.org Content-Type: multipart/alternative; boundary=001a11c3a98407604f04f3d1791b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3a98407604f04f3d1791b Content-Type: text/plain; charset=ISO-8859-1 BlockManager is only responsible for in-memory/on-disk storage. It has nothing to do with re-computation. All the recomputation / retry code are done in the DAGScheduler. Note that when a node crashes, due to lazy evaluation, there is no task that needs to be re-run. Those tasks are re-run only when their outputs are needed for another task/job. On Tue, Mar 4, 2014 at 4:51 PM, dachuan wrote: > Hello, developers, > > I am just curious about the following two things which seems to be > contradictory to each other, please help me find out my understanding > mistakes: > > 1) Excerpted from sosp 2013 paper, "Then, when a node fails, the system > detects all missing RDD partitions and launches tasks to recompute them > from the last checkpoint. Many tasks can be launched at the same time to > compute different RDD partitions, allowing the whole cluster to partake in > recovery." > > 2) Excerpted from code, this function is called when there's one dead > BlockManager, this function didn't launch tasks to recover lost partitions, > instead, it updated many meta-data. > > private def removeBlockManager(blockManagerId: BlockManagerId) { > val info = blockManagerInfo(blockManagerId) > > // Remove the block manager from blockManagerIdByExecutor. > blockManagerIdByExecutor -= blockManagerId.executorId > > // Remove it from blockManagerInfo and remove all the blocks. > blockManagerInfo.remove(blockManagerId) > val iterator = info.blocks.keySet.iterator > while (iterator.hasNext) { > val blockId = iterator.next > val locations = blockLocations.get(blockId) > locations -= blockManagerId > if (locations.size == 0) { > blockLocations.remove(locations) > } > } > } > > thanks, > dachuan. > > -- > Dachuan Huang > Cellphone: 614-390-7234 > 2015 Neil Avenue > Ohio State University > Columbus, Ohio > U.S.A. > 43210 > --001a11c3a98407604f04f3d1791b--