airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nicolas.pa...@riseup.net>
Subject Re: Help SparkJDBCOperator
Date Sat, 09 Feb 2019 11:16:00 GMT
Hi

Be careful with sparkJdbc as a replacement of Sqoop for large tables.
Sqoop is able to handle any source table size while sparkJdbc design does not.
While it provides a way to distribute in multiple partitions, spark is
limited by the executors memory where sqoop is limited by the hdfs
space.

As a result, I have written a spark library (for postgres only right
now) witch overcome the core spark jdbc limitations. It handles any
workload, and my tests show it was 8 times faster than sqoop. I have not
tested it with airflow, but it is compatible with apache livy and
pySpark.

https://github.com/EDS-APHP/spark-postgres


On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote:
> Hi ,
> 
> I am seaching how to substitute Apache Sqoop
> 
> I am analyzing SparkJDBCOperator, but i dont understand how i have to use .
> 
> It a version of  SparkSubmit operator, for include as conection JDBC
> conection ?
> 
>  I need to include Spark code?
> 
> Any example?
> 
> Thanks, I am very lost
> 
> Regards,
> Iván Robla

-- 
nicolas

Mime
View raw message