spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoltan Fedor (JIRA)" <>
Subject [jira] [Created] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
Date Tue, 26 Jul 2016 20:16:20 GMT
Zoltan Fedor created SPARK-16741:

             Summary: spark.speculation causes duplicate rows in df.write.jdbc()
                 Key: SPARK-16741
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.2
         Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
            Reporter: Zoltan Fedor

Since a fix added to Spark 1.6.2 we can write string data back into an Oracle database, so
I went to try it out and found that rows showed up duplicated in the database table after
they got inserted into our Oracle database.

The code we use it very simple:
df = sqlContext.sql("SELECT * FROM example_temp_table")
df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")

The data in the 'target_table' in the database has twice as many rows as the 'df' dataframe
in SparkSQL.

After some investigation it turns out that this is caused by our spark.speculation setting
is being set to True.
As soon as we turned this off, there were no more duplicates generated.

This somewhat makes sense - spark.speculation causes the map jobs to run 2 copies - resulting
in every row being inserted into our Oracle databases twice.
Probably the df.jdbc.write() method does not consider a Spark context running in speculative
mode, hence the inserts coming from the speculative map also get inserted - causing to have
every record inserted twice.

Likely that this bug is independent from the database type (we use Oracle) and whether PySpark
is used or Scala or Java.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message