spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Younes Naguib <Younes.Nag...@tritondigital.com>
Subject RE: Subquery performance
Date Fri, 18 Mar 2016 02:27:06 GMT
Anyways to cache the subquery or force a broadcast join without persisting it?

y

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: March-17-16 8:59 PM
To: Younes Naguib
Cc: user@spark.apache.org
Subject: Re: Subquery performance

Try running EXPLAIN on both version of the query.

Likely when you cache the subquery we know that its going to be small so use a broadcast join
instead of a shuffling the data.

On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <Younes.Naguib@tritondigital.com<mailto:Younes.Naguib@tritondigital.com>>
wrote:
Hi all,

I’m running a query that looks like the following:
Select col1, count(1)
From (Select col2, count(1) from tab2 group by col2)
Inner join tab1 on (col1=col2)
Group by col1

This creates a very large shuffle, 10 times the data size, as if the subquery was executed
for each row.
Anything can be done to tune to help tune this?
When the subquery in persisted, it runs much faster, and the shuffle is 50 times smaller!

Thanks,
Younes

Mime
View raw message