spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <>
Subject Re: CoHadoop Papers
Date Tue, 26 Aug 2014 15:37:25 GMT
Christopher, can you expand on the co-partitioning support?

We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key.  Our analytics team wants to
do frequent joins across these different data-sets based on this key.  It
makes sense that if the data for each key across 'tables' was co-located on
the same server, shuffles could be minimized and ultimately performance
could be much better.

>From reading the HDFS issue I posted before, the way is being paved for
implementing this type of behavior though there are a lot of complications
to make it work I believe.

On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen <> wrote:

> Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
> If the former, Spark does support copartitioning.
> If the latter, it's an HDFS scope that's outside of Spark. On that note,
> Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
> sure the paper makes useful contributions for its set of use cases.
> Sent while mobile. Pls excuse typos etc.
> On Aug 26, 2014 5:21 AM, "Gary Malouf" <> wrote:
>> It appears support for this type of control over block placement is going
>> out in the next version of HDFS:
>> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <>
>> wrote:
>> > One of my colleagues has been questioning me as to why Spark/HDFS makes
>> no
>> > attempts to try to co-locate related data blocks.  He pointed to this
>> > paper: from 2011 on
>> the
>> > CoHadoop research and the performance improvements it yielded for
>> > Map/Reduce jobs.
>> >
>> > Would leveraging these ideas for writing data from Spark make sense/be
>> > worthwhile?
>> >
>> >
>> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message