spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Boesch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows
Date Sat, 06 Dec 2014 18:44:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236919#comment-14236919
] 

Stephen Boesch commented on SPARK-4775:
---------------------------------------

Two small tweaks to the testing class have been made.  I have now created a github branch
for this on the Huawei repo:


https://github.com/Huawei-Spark/spark/blob/SPARKSQL-4775/sql/core/src/test/scala/org/apache/spark/sql/SparkSQLJoinSuite.scala

This test may be run as follows:

    (setup):  mvn -Pyarn -Phadoop-2.3  install compile package -DskipTests
    (run test): mvn -pl sql/core -Pyarn -Phadoop-2.3 -DwildcardSuites=org.apache.spark.sql.SparkSQLJoinSuite
test

results/output:


Run starting. Expected test count is: 1
SparkSQLJoinSuite:
2014-12-06 10:42:36.174 java[22327:958089] Unable to load realm info from SCDynamicStore
10:42:41.370 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
Table1 Contents:
[1,valA1]
[2,valA2]
Table2 Contents:
[1,valB1]
[1,valB2]
[2,valB3]
[2,valB4]
select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
                t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
                    SparkJoinTable2 t2 on t1.intcol = t2.intcol
select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
                t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
                    SparkJoinTable2 t2 on t1.intcol = t2.intcol came back with 4 results
Results
[1,1,valA1,valB2]
[1,1,valA1,valB2]
[2,2,valA2,valB4]
[2,2,valA2,valB4]
ERROR: Row0 failed: Mismatch- act=valB2 exp=valB1
ERROR: Row2 failed: Mismatch- act=valB4 exp=valB3
- Basic Join on vanilla SparkSql: Simple Two Way  2 cols *** FAILED ***
  One or more rows did not match expected (SparkSQLJoinSuite.scala:81)
Run completed in 23 seconds, 258 milliseconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***




> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-4775
>                 URL: https://issues.apache.org/jira/browse/SPARK-4775
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: Run on Mac but should be agnostic
>            Reporter: Stephen Boesch
>
> I am working on testing of HBase joins. As part of this work some simple vanilla SparkSQL
tests were created.  Some of the results are surprising: here are the details:
> ------------------------------------
> Consider the following schema that includes two columns:
> case class JoinTable2Cols(intcol: Int, strcol: String)
> Let us register two temp tables using this schema and insert 2 rows and 4 rows respectively:
>     val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, s"valA$ix")})
>     rdd1.registerTempTable("SparkJoinTable1")
>     val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
>     val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, s"valB$is")})
>     val table2 = rdd2.registerTempTable("SparkJoinTable2")
> Here is the data in both tables:
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> Now let us join the tables on the first column:
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
>                 t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
>                     SparkJoinTable2 t2 on t1.intcol = t2.intcol
> What results do we get:
>  came back with 4 results
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
>       Seq(1, 1, "valA1", "valB1"),
>       Seq(1, 1, "valA1", "valB2"),
>       Seq(2, 2, "valA2", "valB3"),
>       Seq(2, 2, "valA2", "valB4"))
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged version of the
actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message