Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB0A611296 for ; Mon, 15 Sep 2014 22:22:20 +0000 (UTC) Received: (qmail 75881 invoked by uid 500); 15 Sep 2014 22:22:20 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 75780 invoked by uid 500); 15 Sep 2014 22:22:20 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 75768 invoked by uid 99); 15 Sep 2014 22:22:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Sep 2014 22:22:19 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rarecactus@gmail.com designates 74.125.82.45 as permitted sender) Received: from [74.125.82.45] (HELO mail-wg0-f45.google.com) (74.125.82.45) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Sep 2014 22:22:15 +0000 Received: by mail-wg0-f45.google.com with SMTP id z12so4668653wgg.16 for ; Mon, 15 Sep 2014 15:21:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=fLpEp+oAhA+LwB6mmYmmaSe9KDPgBH8q14gA8fniN3E=; b=Ss89ReXODc6p/hBK3aK9xV4X5cOrGlZFwjqtzGKF5iAANyGbD9Gj2UsFVZS67JRoUa 3Tz+zECmaqztzTYiEU+3llIkaW3aMs63B3yrqECUphTZHevBjZDL59Vdfnl2cgtQm8Vs 9LTm+lmjHn3HBrFSkQY7f8pjwVrWKHMr+OrAQEeBS+9nBG8KaS5jVEP00Gpj/fJr0yhb ZEm5x+PJmi1Cu/ukA5LrAWVwfpugxab/Ok14rHVkNjxaVZ181+DL3qglAiBPxsG+b+tx f7gDdz9GNwRnPk43/LRBOb4/vs5MY/5iDrv1QjB5EI4fGKlLklNYtBM1ylL55507GT3F oK7w== MIME-Version: 1.0 X-Received: by 10.194.173.234 with SMTP id bn10mr37834653wjc.81.1410819713936; Mon, 15 Sep 2014 15:21:53 -0700 (PDT) Sender: rarecactus@gmail.com Received: by 10.194.28.6 with HTTP; Mon, 15 Sep 2014 15:21:53 -0700 (PDT) In-Reply-To: References: Date: Mon, 15 Sep 2014 15:21:53 -0700 X-Google-Sender-Auth: HMdxlh4CbX_LQzj9nq2awYv7oag Message-ID: Subject: Re: CoHadoop Papers From: Colin McCabe To: Gary Malouf Cc: "dev@spark.apache.org" Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org This feature is called "block affinity groups" and it's been under discussion for a while, but isn't fully implemented yet. HDFS-2576 is not a complete solution because it doesn't change the way the balancer works, just the initial placement of blocks. Once heterogeneous storage management (HDFS-2832) is implemented, you will be able to get a similar effect through using separate storages, at the cost of fragmenting the backing store somewhat. Of course, "co-locating related data blocks" is often bad, not good, because it reduces the amount of parallelism a single job can exploit, and can increase the chance of losing an entire dataset due to node failures. That's one reason why the current semi-random placement strategy has lasted so long. In other words, this is workload-dependent. best, Colin On Tue, Aug 26, 2014 at 5:20 AM, Gary Malouf wrote: > It appears support for this type of control over block placement is going > out in the next version of HDFS: > https://issues.apache.org/jira/browse/HDFS-2576 > > > On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf wrote: > >> One of my colleagues has been questioning me as to why Spark/HDFS makes no >> attempts to try to co-locate related data blocks. He pointed to this >> paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the >> CoHadoop research and the performance improvements it yielded for >> Map/Reduce jobs. >> >> Would leveraging these ideas for writing data from Spark make sense/be >> worthwhile? >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org