spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
Date Thu, 08 Oct 2015 18:13:28 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949079#comment-14949079
] 

Reynold Xin commented on SPARK-10914:
-------------------------------------

OK I figured it out. Updated the description.


> UnsafeRow serialization breaks when two machines have different Oops size
> -------------------------------------------------------------------------
>
>                 Key: SPARK-10914
>                 URL: https://issues.apache.org/jira/browse/SPARK-10914
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>         Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>            Reporter: Ben Moran
>
> *Updated description (by rxin on Oct 8, 2015)*
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> And then run the following
> {code}
> scala> sql("select 1 xx").collect()
> {code}
> The problem is that UnsafeRow contains 3 pieces of information when pointing to some
data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo
serialization, the object layout in memory can change if two machines have different pointer
width (Oops in JVM).
> *Original bug report description*:
> Using an inner join, to match together two integer columns, I generally get no results
when there should be matches.  But the results vary and depend on whether the dataframes are
coming from SQL, JSON, or cached, as well as the order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on new installs
of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message