spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Nastetsky <alex.nastet...@vervemobile.com>
Subject Re: Sort Merge Join from the filesystem
Date Mon, 09 Nov 2015 19:02:47 GMT
Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset
A which is already partitioned/sorted on disk and dataset B, which gets
generated during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to
be performed on it, using the same partitioner that was used with dataset
A. Then dataset B could be joined with dataset A without needing to write
it to disk first (unless it's too big to fit in memory, then it would need
to be [partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <hao.cheng@intel.com> wrote:

> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> alex.nastetsky@vervemobile.com> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>

Mime
View raw message