spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Spark join over sorted columns of dataset.
Date Fri, 03 Mar 2017 16:23:11 GMT
For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:
https://github.com/tresata/spark-sorted

For Dataset/DataFrame such optimizations are done automatically, however
it's currently not always working for Dataset, see:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468

On Mar 3, 2017 11:06 AM, "Rohit Verma" <rohit.verma@rokittech.com> wrote:

Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma <rohit.verma@rokittech.com>
wrote:
>
> Hi
>
> While joining two columns of different dataset, how to optimize join if
both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
>
> Regards
> Rohit

Mime
View raw message