Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: domain of rarecactus@gmail.com
 designates 74.125.82.45 as permitted sender)
MIME-Version: 1.0
Sender: rarecactus@gmail.com
In-Reply-To: 
 <CAGOvqio9vdiDApeV=XshexXtCnkwbBbBZPeCeAo2EaOhFNMGng@mail.gmail.com>
References: 
 <CAGOvqipJEM=biUE5W7r_f+2q5CTh_Ndnq=PE3rHfVMruBWN05g@mail.gmail.com>
	<CAGOvqio9vdiDApeV=XshexXtCnkwbBbBZPeCeAo2EaOhFNMGng@mail.gmail.com>
Date: Mon, 15 Sep 2014 15:21:53 -0700
Message-ID: 
 <CA+qbEUMCxz5usbOm4QoOkvznTotXA7mxBM2a0utkdt=eh2SeBw@mail.gmail.com>
Subject: Re: CoHadoop Papers
From: Colin McCabe <cmccabe@alumni.cmu.edu>
To: Gary Malouf <malouf.gary@gmail.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: text/plain; charset=UTF-8

This feature is called "block affinity groups" and it's been under
discussion for a while, but isn't fully implemented yet.  HDFS-2576 is
not a complete solution because it doesn't change the way the balancer
works, just the initial placement of blocks.  Once heterogeneous
storage management (HDFS-2832) is implemented, you will be able to get
a similar effect through using separate storages, at the cost of
fragmenting the backing store somewhat.

Of course, "co-locating related data blocks" is often bad, not good,
because it reduces the amount of parallelism a single job can exploit,
and can increase the chance of losing an entire dataset due to node
failures.  That's one reason why the current semi-random placement
strategy has lasted so long.  In other words, this is
workload-dependent.

best,
Colin

On Tue, Aug 26, 2014 at 5:20 AM, Gary Malouf <malouf.gary@gmail.com> wrote:
> It appears support for this type of control over block placement is going
> out in the next version of HDFS:
> https://issues.apache.org/jira/browse/HDFS-2576
>
>
> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.gary@gmail.com> wrote:
>
>> One of my colleagues has been questioning me as to why Spark/HDFS makes no
>> attempts to try to co-locate related data blocks.  He pointed to this
>> paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
>> CoHadoop research and the performance improvements it yielded for
>> Map/Reduce jobs.
>>
>> Would leveraging these ideas for writing data from Spark make sense/be
>> worthwhile?
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org