spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gsvic <victora...@gmail.com>
Subject RE: ShuffledHashJoin Possible Issue
Date Mon, 19 Oct 2015 09:59:51 GMT
Hi Hao,

Each table is created with the following python code snippet:

data = [{'id': 'A%d'%i, 'value':ceil(random()*10)} for i in range(0,50)]
with open('A.json', 'w+') as output:
    json.dump(data, output)

The tables A and B containing 10 and 50 tuples respectively. 

In spark shell I type

sqlContext.setConf("spark.sql.planner.sortMergeJoin", "false") to disable
sortMergeJoin and
sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "0") to disable
BroadcastHashJoin, cause the tables are too small and this join will be
selected.

Finally I run the following query:
t1.join(t2).where(t1("id").equalTo(t2("id"))).count

and the result I get equals to zero, while ShuffledHashJoin and
SortMergeJoin returns the right result (10).



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ShuffledHashJoin-Possible-Issue-tp14672p14682.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message