airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Driesprong, Fokko" <fo...@driesprong.frl>
Subject Re: Help SparkJDBCOperator
Date Sun, 10 Feb 2019 11:45:33 GMT
Looking good Nicolas, thanks for sharing.

Since there is also Pyspark support, it should be relative straightforward
to invoke the spark-postgres library from Airflow.

Cheers, Fokko

Op za 9 feb. 2019 om 12:16 schreef Nicolas Paris <nicolas.paris@riseup.net>:

> Hi
>
> Be careful with sparkJdbc as a replacement of Sqoop for large tables.
> Sqoop is able to handle any source table size while sparkJdbc design does
> not.
> While it provides a way to distribute in multiple partitions, spark is
> limited by the executors memory where sqoop is limited by the hdfs
> space.
>
> As a result, I have written a spark library (for postgres only right
> now) witch overcome the core spark jdbc limitations. It handles any
> workload, and my tests show it was 8 times faster than sqoop. I have not
> tested it with airflow, but it is compatible with apache livy and
> pySpark.
>
> https://github.com/EDS-APHP/spark-postgres
>
>
> On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote:
> > Hi ,
> >
> > I am seaching how to substitute Apache Sqoop
> >
> > I am analyzing SparkJDBCOperator, but i dont understand how i have to
> use .
> >
> > It a version of  SparkSubmit operator, for include as conection JDBC
> > conection ?
> >
> >  I need to include Spark code?
> >
> > Any example?
> >
> > Thanks, I am very lost
> >
> > Regards,
> > Iván Robla
>
> --
> nicolas
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message