spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <mszymkiew...@gmail.com>
Subject Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors
Date Mon, 21 Nov 2016 09:32:23 GMT
In commonly used RDBM systems relations have no fixed order and physical
location of the records can change during routine maintenance
operations. Unless you explicitly order data during retrieval order you
see is incidental and not guaranteed. 

Conclusion: order of inserts just doesn't matter.

On 11/21/2016 10:03 AM, Niranda Perera wrote:
> Hi, 
>
> Say, I have a table with 1 column and 1000 rows. I want to save the
> result in a RDBMS table using the jdbc relation provider. So I run the
> following query, 
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value,
> count(*) from table1 group by value order by value would have around
> 200+ partitions and say I have 4 executors attached to my driver. So,
> I would have 200+ writing tasks assigned to 4 executors. I want to
> understand, how these executors are able to write the data to the
> underlying RDBMS table of table2 without messing up the order. 
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does
> the following
>
> df.foreachPartition { iterator =>
>       savePartition(getConnection, table, iterator, rddSchema,
> nullTypes, batchSize, dialect)
>     }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one
> should write data before the other! 
>
> In the savePartition method, in the comment, it says 
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>    * a single database transaction in order to avoid repeatedly inserting
>    * data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the
> database resource, from each executor (i.e. ex0 would first obtain a
> lock for the table and write the partition0, while ex1 ... ex3 would
> wait till the lock is released )? 
>
> In my experience, there is no harm done to the order of the results at
> the end of the day! 
>
> Would like to hear from you guys! :-) 
>
> [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>
> -- 
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/

-- 
Best regards,
Maciej Szymkiewicz


Mime
View raw message