spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng, Hao" <>
Subject RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?
Date Mon, 12 Oct 2015 01:31:29 GMT
Thank you Ted, that’s very informative; from the DB optimization point of view, the Cost
Base join re-ordering, and the multi-way joins does provide better performance;

But from the API design point of view, 2 arguments (relation) for JOIN in the DF API probably
be enough for the multiple tables join cases, as we can always use the nested 2 way joins
to represents the multi-joins.
For example: A join B join C join D (multi-way join)=>
((A join B) join C) join D
 (A join (B join C)) join D
(A join B) join (C join D) etc.
That also means, we leave the optimization work for Spark SQL, not by users, and we believe
Spark SQL can do most of the dirty work for us.

However, sometimes, people do write an optimal SQL (e.g. with better join ordering) than the
Spark SQL optimizer does, then we’d better resort to the API SqlContext.sql(“”).


From: Ted Yu []
Sent: Monday, October 12, 2015 8:37 AM
To: Cheng, Hao
Cc: Richard Eggert; Subhajit Purkayastha; User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

Some weekend reading:


On Sun, Oct 11, 2015 at 5:32 PM, Cheng, Hao <<>>
A join B join C === (A join B) join C
Semantically they are equivalent, right?

From: Richard Eggert [<>]
Sent: Monday, October 12, 2015 5:12 AM
To: Subhajit Purkayastha
Cc: User
Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

It's the same as joining 2. Join two together, and then join the third one to the result of
On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" <<>>
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 2 RDDs but
not 3.


View raw message