spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miguel Tormo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-17211) Broadcast join produces incorrect results when compressed Oops differs between driver, executor
Date Sat, 03 Sep 2016 21:17:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459925#comment-15459925
] 

Miguel Tormo edited comment on SPARK-17211 at 9/3/16 9:16 PM:
--------------------------------------------------------------

[~gurmukhd], you say it works only for heap < 32 GB, which one do you mean? Driver memory
or executors? There is only need to disable compressed oops for the heap that is under 32
GB when the other isn't. You can test disabling it for both cases to be on the safe side:

spark-shell ...  --driver-java-options "-XX:-UseCompressedOops" --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
...

[~davies] I just tried the patch, it seems to work! I couldn't break it although probably
it needs more thorough testing to cover other cases. But it does seem to fix the issue, thanks!




was (Author: migtor):
[~gurmukhd], you say it works only for heap < 32 GB, which one do you mean? Driver memory
or executors? There is only need to disable compressed oops for the heap that is over 32 GB
when the other isn't. You can test disabling it for both cases to be on the safe side:

spark-shell ...  --driver-java-options "-XX:-UseCompressedOops" --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
...

[~davies] I just tried the patch, it seems to work! I couldn't break it although probably
it needs more thorough testing to cover other cases. But it does seem to fix the issue, thanks!



> Broadcast join produces incorrect results when compressed Oops differs between driver,
executor
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17211
>                 URL: https://issues.apache.org/jira/browse/SPARK-17211
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Jarno Seppanen
>            Assignee: Davies Liu
>
> Broadcast join produces incorrect columns in join result, see below for an example. The
same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
>     (54000000, 0),
>     (54000001, 1),
>     (54000002, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # +--------+-----+
> # |  key_id|value|
> # +--------+-----+
> # |54000000|    0|
> # |54000001|    1|
> # |54000002|    2|
> # +--------+-----+
> data = [
>     (54000002,    1),
>     (54000000,    2),
>     (54000001,    3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # +--------+---+                                                                  
> # |  key_id|foo|
> # +--------+---+
> # |54000002|  1|
> # |54000000|  2|
> # |54000001|  3|
> # +--------+---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # +--------+---+--------+                                                         
> # |  key_id|foo|   value|
> # +--------+---+--------+
> # |54000002|  1|54000002|
> # |54000000|  2|54000000|
> # |54000001|  3|54000001|
> # +--------+---+--------+
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # +--------+---+-----+
> # |  key_id|foo|value|
> # +--------+---+-----+
> # |54000000|  2|    0|
> # |54000001|  3|    1|
> # |54000002|  1|    2|
> # +--------+---+-----+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message