nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Thomsen <mikerthom...@gmail.com>
Subject Re: NiFi for Easy and Efficient Data Ingestion from Different Sources into Delta Lake
Date Tue, 07 Apr 2020 16:40:31 GMT
Martin,

Please take a look at this thread as it will answer a lot of the questions
about NiFi integration:

http://apache-nifi-users-list.2361937.n4.nabble.com/How-to-use-delta-storage-format-td9259.html

Pierre's point about how it works comes out of that discussion. The current
connector approach of using a standalone Spark job is simply not scalable
for a real data lake. The thing that we keep coming back to here is that
you cannot get away from having Spark do the heavy lifting of integrating
and reconciling the everything. This makes a lot of sense because
ultimately, Spark is the analytics engine that will do the heavy lifting of
analyzing the data set to answer users' queries.

Realistically, there is nothing stopping you from integrating NiFi and
Delta Lake right now with off the shelf NiFi capabilities because you just
use the built-in Parquet features, push Parquet data up to where Spark will
read it and then set up a means to fire off Spark at an interval that meets
your users' needs.

It's also worth noting that right now the Hive support is very immature, as
it apparently is read-only. So the best guidance we can give you right now
is to just roll out a flow that converts data to Parquet and uses a simple
Spark job to integrate the Parquet data into Delta Lake.

Thanks,

Mike

On Tue, Apr 7, 2020 at 4:45 AM Pierre Villard <pierre.villard.fr@gmail.com>
wrote:

> Hi Martin,
>
> As far as I understood the recent discussions on this matter, it would
> require to run a Spark standalone job in the NiFi JVM to actually write
> data in Delta Lake. Mike may have better ideas but, to me, it sounds like a
> tedious process.
>
> Thanks,
> Pierre
>
> Le lun. 6 avr. 2020 à 21:40, Martin Ebert <martinebert1989@gmail.com> a
> écrit :
>
> > Hi Joe,
> > I don't know if this is the right channel for this.  But I've been
> talking
> > to the Tech Lead for Delta at Databricks: "I’m very pro any other OSS
> > projects that want to integrate with Delta. Feel free to put those
> > committers in touch with me if they are looking to build integrations."
> >
> > A general delta integration would be a great gain for the Nifi community
> if
> > one would go into the exchange here. As far as I could follow this, Mike
> > Thomson has dealt with delta a bit.
> >
> > Best,
> > Martin
> >
> > Joe Witt <joe.witt@gmail.com> schrieb am Fr., 27. März 2020, 21:42:
> >
> > > Martin,
> > >
> > > If I follow that blog post correctly Databricks (a vendor) in
> partnership
> > > with several other vendors created a set of integrations to feed data
> > into
> > > Delta Lake.  It might make sense then for Databricks to partner with
> the
> > > appropriate vendor around Apache NiFi to have that same kind of
> > > engagement/collaboration/etc..
> > >
> > > Here though - in the Apache NiFi community it is just simply about the
> > > community establishing some JIRAs and what it wants to do as it relates
> > to
> > > feeding data from Apache NiFi into Delta Lake and knocking it out. You
> > seem
> > > to have been learning a lot about NiFi for the past couple months.
> And I
> > > suspect you already know a good bit about Delta Lake.  You're perhaps
> in
> > > the best position to help identify some things you think we should do.
> > If
> > > that is right I'm looking forward to hearing some ideas you have.
> There
> > is
> > > clearly something we can do here - or maybe already it works...  But
> lets
> > > find out.  You'll likely have to help lead/push/advocate for that but
> > with
> > > some momentum I think you'll find folks help.
> > >
> > > Thanks
> > >
> > > On Fri, Mar 27, 2020 at 3:38 PM Martin Ebert <
> martinebert1989@gmail.com>
> > > wrote:
> > >
> > > > Hi community,
> > > > are there any plans to adopt this auto loader functionality:
> > > >
> > > >
> > >
> >
> https://databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-lake.html
> > > >
> > > > What are the plans for NiFi 2020?
> > > >
> > > > Best,
> > > > Martin
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message