airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Iván Robla Albarrán <ivanro...@gmail.com>
Subject Re: Help SparkJDBCOperator
Date Mon, 11 Feb 2019 09:04:42 GMT
thanks for sharing.

I will analyze your postgres solution

Thanks!

Regards,
Iván


El dom., 10 feb. 2019 a las 12:45, Driesprong, Fokko (<fokko@driesprong.frl>)
escribió:

> Looking good Nicolas, thanks for sharing.
>
> Since there is also Pyspark support, it should be relative straightforward
> to invoke the spark-postgres library from Airflow.
>
> Cheers, Fokko
>
> Op za 9 feb. 2019 om 12:16 schreef Nicolas Paris <nicolas.paris@riseup.net
> >:
>
> > Hi
> >
> > Be careful with sparkJdbc as a replacement of Sqoop for large tables.
> > Sqoop is able to handle any source table size while sparkJdbc design does
> > not.
> > While it provides a way to distribute in multiple partitions, spark is
> > limited by the executors memory where sqoop is limited by the hdfs
> > space.
> >
> > As a result, I have written a spark library (for postgres only right
> > now) witch overcome the core spark jdbc limitations. It handles any
> > workload, and my tests show it was 8 times faster than sqoop. I have not
> > tested it with airflow, but it is compatible with apache livy and
> > pySpark.
> >
> > https://github.com/EDS-APHP/spark-postgres
> >
> >
> > On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote:
> > > Hi ,
> > >
> > > I am seaching how to substitute Apache Sqoop
> > >
> > > I am analyzing SparkJDBCOperator, but i dont understand how i have to
> > use .
> > >
> > > It a version of  SparkSubmit operator, for include as conection JDBC
> > > conection ?
> > >
> > >  I need to include Spark code?
> > >
> > > Any example?
> > >
> > > Thanks, I am very lost
> > >
> > > Regards,
> > > Iván Robla
> >
> > --
> > nicolas
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message