nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlos Manuel Fernandes (DSI)" <carlos.antonio.fernan...@cgd.pt>
Subject RE: ELT on Nifi
Date Fri, 07 Oct 2016 17:29:02 GMT
Andy,

Good suggestion, i  will do that  , I had created several executeScript (in groovy) before.

Thanks

Carlos





From: Andy LoPresto [mailto:alopresto@apache.org]
Sent: sexta-feira, 7 de Outubro de 2016 18:21
To: users@nifi.apache.org
Subject: Re: ELT on Nifi

Carlos,

If you are comfortable with Groovy I would suggest you look at using ExecuteScript [1] processor
to prototype what you want the processor to do. That processor will take an (inline or read
from file) Groovy script and execute it within the processor lifecycle. Matt Burgess has written
some excellent blog posts on getting started with it [2][3].

Once you have that behaving the way you like (and feel free to continue to ask questions here),
another developer would probably be able to help you convert it to a “real" custom processor.

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.ExecuteScript/index.html
[2] https://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html
[3] https://funnifi.blogspot.com/2016/02/writing-reusable-scripted-processors-in.html


Andy LoPresto
alopresto@apache.org<mailto:alopresto@apache.org>
alopresto.apache@gmail.com<mailto:alopresto.apache@gmail.com>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 7, 2016, at 7:20 AM, João Henrique Freitas <joaohf@gmail.com<mailto:joaohf@gmail.com>>
wrote:

Hi.
Maybe a linkedin/databus client processor could be created to handle ETL.

Em 06/10/2016 10:39, "Carlos Manuel Fernandes (DSI)" <carlos.antonio.fernandes@cgd.pt<mailto:carlos.antonio.fernandes@cgd.pt>>
escreveu:
Hi Uwe,

I saw you had developed similar approach of mine. Joe Witt lunched a challenge  to build a
processor based on Json structure I proposed.

I think  we can use the code of convertJSONtoSQl processor as a template for this new processor.
 This new processor will belong  to the category  - JSONtoSQL (the convertJSONtoSQL is the
first one).

We can  work together to reach this goal but first we must agree on the Json structure for
the input.

What you think?  You can contact me directly.

Thanks

Carlos

From: Uwe Geercken [mailto:uwe.geercken@web.de<mailto:uwe.geercken@web.de>]
Sent: terça-feira, 4 de Outubro de 2016 14:42
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Aw: Re: ELT on Nifi

Carlos,

I think that is a good point.

But I would like to bring up a little different view to it:

I have developed a business ruleengine (open source) written in Java and it is meanwhile in
production at least at two bigger companies - they both use the Pentaho ETL tool together
with the ruleengine. You can use the rules to filter/evaluate conditions and there are also
actions which execute or transform data. The advantage is, that within Pentaho it is just
a plugin and the business logic (or if you will also IT logic) it managed externally (through
a web interface and possibly by users or superusers themselve and not by IT). This keeps a
proper seperation of responsibilities of business logic and IT logic and the ETL process itself
is much, much cleaner.

Likewise one could think of creating a plugin for Nifi which takes a similar approach: you
have a processor that in the background calls the ruleengine. It runs and deliveres the results
back to the process. Instead of having complex connections between transformation processors,
which clutter the Nifi desktop there would be one processor for the ruleengine (of course
also multiple ones).

In one of my later projects I have implemented the complete invoicing process for the company
I work for using the ruleengine. The ETL is very clean and contains only IT logic (formatting
of fields, splitting of fields, renaming, etc) and the rest is in external rule projects which
contain the business logic.

My thinking is that the devision of responsibilities for the logic and a clean ETL or in the
Nifi case a clean Flow diagram is a very strong argument for this approach.

Of course there is nothing to say against a mixed approach - custom processors and ruleengine
- I just wanted to explain my point a little bit. Everything is available on github.com/uwegeercken<http://github.com/uwegeercken>.

I could write the Nifi code for the processor I guess, but I will need some help with testing,
documentation and also packaging the nar file (I am not used to Maven and have struggled in
the past to create a proper nar archive).

Greetings,

Uwe

Gesendet: Dienstag, 04. Oktober 2016 um 04:48 Uhr
Von: "Matt Burgess" <mattyb149@apache.org<mailto:mattyb149@apache.org>>
An: users@nifi.apache.org<mailto:users@nifi.apache.org>
Betreff: Re: ELT on Nifi
Carlos,

The extensible nature of NiFi, whether the overall architecture was intended for ETL/ELT and/or
RDBMS/DW concepts or not, means that many of these kinds of operations are welcome (but possibly
not yet present) in NiFi. Some might warrant framework changes, but for a good portion, many
RDBMS/DW processors are possible but just haven't been added/contributed yet. In my experience,
ETL/ELT tools have focused mainly on this kind of "processor" and in contrast can't handle
the level of throughput, data formats, provenance/lineage, security, and/or data integrity
that NiFi can. In exchange, NiFi doesn't have as many of the RDBMS/DW-specific processors
available at this time. I see a few categories (please feel free to add/change/delete/discuss),
mostly having to do with tabular (row-oriented, character-delimited) data:

1) Row-level operations. This includes projections (select fields from row), alter fields
(change timestamp of column 'last_updated', e.g.), add column(s), replace-with-lookup, etc.
2) Table-level operations. This includes joins, grouping/aggregates, transposition, etc.
3) Composition/Application of the other two. This includes normalization & denormalization
(star/snowflake schemas, e.g.), dimension updates (Kimball's SCD Type 2, e.g.), etc.
4) Bulk Loading. These usually involve custom code (although in many cases for NiFi you can
deploy a command-line tool for bulk loading to a DB and use ExecuteProcess or ExecuteStreamCommand
to make it happen). These are usually native processes for getting lots of data into the DB
using an end-run around their own interfaces, possibly bypassing mechanisms that NiFi embraces,
such as provenance. But they are often faster than their SQL interface counterparts for large
data ingest.
5) Transactions. This involves executing a number of SQL statements as an atomic group (i.e.
BEGIN, a bunch of INSERTs, COMMIT). Not all DBs support this (and many have their own dialects
for such things).

That's a lot of feature surface to cover! Luckily we have an ever-growing community filled
with folks representing a whole spectrum of experience and a shared passion for data :)  I
am very interested in your thoughts on where NiFi could improve on these (or other) fronts
with respect to ETL/ELT, I think we can get some good discussions (and code contributions!)
going on this. Alternatively, if you'd like to pursue a discussion on how to offload data
transformations, I'm sure the community has thoughts on that as well.

Regards,
Matt

P.S. I didn't include push-down optimization on the list because of its complexity and in
NiFi terms involves things like dynamic flow-rewrites and other magic that IMHO is against
the design principles of NiFi itself (simplicity, accountability, e.g.).

On Mon, Oct 3, 2016 at 2:25 PM, Carlos Manuel Fernandes (DSI) <carlos.antonio.fernandes@cgd.pt<mailto:carlos.antonio.fernandes@cgd.pt>>
wrote:
Hi all,

When i saw Nifi for the first time , I try to build  a classical ETL/ELT flow , and this question
is recurrent for the new users.

Nifi has very good processors for the Extract and Load, the problem arise on Transform, because
in ETL/ELT  tools there are specific “processors”  (ex: map, SCD, etc.)  binded to DW
concepts  and sometimes binded  to a specific database (ex: SCDNetezza) . The Transformer
processors in Nifi  are general purpose  and not correlated with  this concepts. The immediate
solution is to create a lot of Custom script processors but  the metadata of ELT (sql) turn
attributes or code of processors, not an ideal solution.

But, If we put  the logic of Transform  outside of Nifi, for example in some Json structure
, then its relative easy, construct a ELT NIFI Template capable of run a generic ELT flows.

Example of a ELT JSon Structure  (the “steps” inside  the “flow” are to be executed
on PutSql in the same transaction)
{
       "Transformer": [{
             "name": "foo1",
             "type": "Map",
             "description": "Summarize the table foo from table bar",
             "flow": [{
                    "step": 1,
                    "description": "delete all data",
                    "stmt": "delete from  foo"
             }, {
                    "step": 2,
                    "Description": "Count f2 by f1",
                    "stmt": "insert into foo(c1, c2) select c1,sum(c2) from bar group by c1"
             }]
       }, {
             "name": "foo2",
             "type": "SCD- Slowly change Dimensions type 1",
             "description": "Update a prod table based on stage table",
             "flow": [{
                    "step": 1,
                    "description": "Process type 1",
                    "stmt": "Update Prod Set Prod.columns = Stage.Columns From Stage Inner
Join Prod on Stage.key = Prod.key Where Stage.IsType1 = 1 "
             }]
       }]
}

Example of a  NIFI template who execute that Json structure :

<image001.png>


This make sense?  Give me feedback.

Carlos




Mime
View raw message