nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emanuel Oliveira <emanu...@gmail.com>
Subject Re: basic enhancements + incomplete processors + when nifi gets more complicated than it should..
Date Mon, 03 Feb 2020 21:14:36 GMT
Thanks Pierre, seems I already had an account:
username: emanueol



Best Regards,
*Emanuel Oliveira*



On Mon, Feb 3, 2020 at 7:34 PM Pierre Villard <pierre.villard.fr@gmail.com>
wrote:

> Hi Emanuel,
>
> Just wanted to answer to your questions regarding JIRA. You can create an
> account on the Apache JIRA [1] and open JIRAs on the NiFi project [2]. Once
> you created an account and logged in the JIRA once, you can share your
> login with us, and we can grant you the "contributor" role which gives you
> the right to assign yourself a JIRA if you want. But there is no specific
> requirements to create JIRAs and/or comment existing JIRAs.
>
> [1] https://issues.apache.org/jira/secure/Signup!default.jspa
> [2] https://issues.apache.org/jira/projects/NIFI
>
>
> Le lun. 3 févr. 2020 à 13:59, Emanuel Oliveira <emanueol@gmail.com> a
> écrit :
>
> > Hi Mike,
> >
> > Let me summarize as i see my long post is not pads ng the clean easy
> > message i intended:
> > *processor I**nferAvroSchema*:
> > - should retrieve types from analysing data from csv. property "Input
> > Content Type" lists CSV, JSON but in reality the property "Number Of
> > Records To Analyze" only works with Json. With CSV all types are
> strings..
> > Not hard to detect if a field only contains digits or alphanumerics, only
> > timestamps could need extra property to help with format (or out of the
> box
> > just also detect timestamps as well.. not hard).
> >
> > *Mandatory subset of fields verification:*
> > ValidateRecord allows optional 3 schema properties (outside reader and
> > writer) to supply an avro schema to balidate mandatory subset of fields -
> > but - ConvertRecord doesn't allow this.
> >
> >
> > Finally i would like to request your suggestion for following use
> case(same
> > we struggled):
> > - given 1 csv with header line listing 100 fields we want:
> > --- validate mandatory fields (just 1 or 2 fields).
> > --- automatic create avroschema based on data lines.
> > ---export avro like this:
> > ------ some fields obfuscated + remaining fields not obfuscated (or the
> > other way around: some fields not obfuscated + remaining fields
> > obfuscated). And of course header line stay in line with final fields
> > order.
> >
> > This may look simple use case, but very hard to implement due.. but
> please
> > do surprise me with sequence of processors needed to implement what i
> think
> > its a great real world example of data quality (mandatory fields +
> parcial
> > obfuscation + export as different format and just subset of the fields
> and
> > where some obfuscated and others not).
> >
> > Thanks and hope mote clear, im sure this will help more dev teams.
> >
> > Cheers,
> > Emanuel
> >
> >
> >
> >
> > On Mon 3 Feb 2020, 13:50 Mike Thomsen, <mikerthomsen@gmail.com> wrote:
> >
> > > One thing I should mention is that schema inference is simply not
> capable
> > > of exploiting Avro's field aliasing. That's an incredibly powerful
> > feature
> > > that allows you to reconcile data sets without writing a single line of
> > > code. For example, I wrote a schema last year that uses aliases to
> > > reconcile 9 different CSV data sets into a common model without writing
> > one
> > > line of code. This is all it takes:
> > >
> > > {
> > >   "name": "first_name",
> > >   "type": "string",
> > >   "aliases": [ "FirstName", "First Name", "FIRST_NAME", "fname",
> "fName"
> > ]
> > > }
> > >
> > > That one manual line just reconciled 5 fields into a common model.
> > >
> > > On Sun, Feb 2, 2020 at 9:32 PM Mike Thomsen <mikerthomsen@gmail.com>
> > > wrote:
> > >
> > > > Hi Emanuel,
> > > >
> > > > I think you raise some potentially valid issues that are worth
> looking
> > at
> > > > in more detail. I can say our experience with NiFi is exact opposite,
> > but
> > > > part of that is that we are a 100% "schema first" shop. Avro is
> > insanely
> > > > easy to learn, and we've gotten junior data engineers up to speed in
> a
> > > > matter of days producing beta quality data contracts that way.
> > > >
> > > > On Sat, Feb 1, 2020 at 12:33 PM Emanuel Oliveira <emanueol@gmail.com
> >
> > > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Based on recent experience, I found very hard to implement logic
> > which i
> > > >> think should exists out of the box, and instead it was slow process
> of
> > > >> keeping discovering a property on a processor only works for a type
> of
> > > >> data
> > > >> when processor supports multiple types etc.
> > > >>
> > > >> I would like you all to keep it simple attitude and imagine hwo you
> > > would
> > > >> implement a basic scenario as:
> > > >>
> > > >> *basic scenario 1 - shall be easy to implement out of the box
> > following
> > > 3
> > > >> needs:*
> > > >> CSV (*get schema automatically via header line*) --> *validate
> > mandatory
> > > >> subset of fields (presence) and (data types)* --> *export subset
of
> > > >> fields*
> > > >> or all (but want some of them obfuscated)
> > > >> problems/workarounds found 1.9 rc3
> > > >>
> > > >> *1. processor ValidateRecord*
> > > >> [1.1] *OK* - allows *getting schema automatically via header line*
> and
> > > >> *mandatory
> > > >> subset of fields* (presence) via the 3 schema properties --> suggest
> > > >> rename
> > > >> properties to make clear those at processor level are "mandatory
> > check"
> > > vs
> > > >> the schema on reader which is the well the data read schema.
> > > >> [1.2] *NOK* - does not allow *types validation**.* *One could
> thinking
> > > >> using InferSchema right ? problem is it only supports JSON.*
> > > >> [1.2] *NOK* - ignores writer schema where one could supply *subset
> of
> > > >> original fields* (always export all original fields) --> add
> property
> > to
> > > >> control export all fields (default) or use writer schema(with
> subset).
> > > >>
> > > >> *2. processor ConvertRecord*
> > > >> [2.1] *OK* csvreader able to *get schema from header -*-> maybe
> > > >> improve/add
> > > >> property to cleanup fields (regex search/replace - so we can strip
> > > >> whitespaces and anything else that breaks nifi processors and/or
> that
> > > >> doesnt interest us)
> > > >> [2.2] *NOK* missing *mandatory subset of fields.*
> > > >> [2.3] *OK* but does good jobs converting between formats, and/or
> > *export
> > > >> all or subset of fields via writer schema*.
> > > >>
> > > >> *3. processor InferAvroSchema*
> > > >> [3.1] NOK - despite property "Input Content Type" lists CSV, JSON
as
> > > >> inbound data, in reality the property "Number Of Records To Analyze"
> > > only
> > > >> supports JSON. Took us 2 days debugging to understand the problem..
> 1
> > > CSV
> > > >> with 4k lines and mostly nulls, "1"s or "2"s but some few records
> > would
> > > be
> > > >> "true" or "false".. meaning avro data type should have been [null,
> > > string]
> > > >> but no.. as we found out, type kept being [null, long] with doc
> always
> > > >> using 1st data line in CSV to determine field type. This was VERY
> > > scaring
> > > >> to find out.. how can it be this was fully working as expected ? We
> > > endup
> > > >> needing to add +1 processor to convert CSV into JSON so we could get
> > > >> proper
> > > >> schema.. and even now we still testing, as seems all fields got
> > [string]
> > > >> when some columns should be long.
> > > >>
> > > >> Im not sure the best way to expose this, but im working at
> enterprise
> > > >> level, and believe me, this small but critical nuances are starting
> to
> > > >> push
> > > >> the mood on NiFi.
> > > >> But because I felt in love with NiFi and i like the idea of
> graphical
> > > >> design of flows etc, but we really must fix this critical little
> > > devils..
> > > >> they are being screamed as nifi problems at management level.
> > > >> I know nifi is open source, and its upon us developers to improve,
i
> > > just
> > > >> would like to call attention that we must be sure on the middle of
> PRs
> > > and
> > > >> JIRA enhancements we not forgetting the basic threshold.. doesn't
> make
> > > >> sense to release a processor with only 50% of its main goal
> developed
> > > when
> > > >> the remaining work would be easy and fast to do (aka
> InferAvroSchema).
> > > >>
> > > >> As i keep experimenting more and more with NiFi, i start detecting
> the
> > > >> level of basic quality features is bellow from what i think it
> should
> > > be.
> > > >> Better not release incomplete processors at least regarding core
> > > function
> > > >> of the processor.
> > > >>
> > > >> I know developers can contributes with new code, fixes and
> > > enhancements..
> > > >> but is there any gatekeeper team double checking the deliverables
?
> > like
> > > >> at
> > > >> basic developer should provide enough unite tests.. again the
> > > >> InferAvroSchema being a processor to export avro schema based on
> > either
> > > a
> > > >> CSV or JSON, then obviously there should be couple unit testings
> CSVs
> > > and
> > > >> JSON with different data so we can be sure sure we have the proper
> > type
> > > on
> > > >> the avro schema exported right ?
> > > >>
> > > >> Above i share some ideas, and i got much more from my day by day
> > > >> experience
> > > >> that i been working with NiFi at entperise level for more than 1
> year
> > by
> > > >> now.
> > > >> Let me know what shall be the way to create JIRAs to fix several
> > > >> processors
> > > >> in order to allow aone unexperienced nifi client developer to
> > accomplish
> > > >> the basic flow of:
> > > >>
> > > >> CSV (*get schema automatically via header line*) --> *validate
> > mandatory
> > > >> subset of fields (presence) and (data types)* --> *export subset
of
> > > >> fields*
> > > >> or all (but want some of them obfuscated)
> > > >>
> > > >> I challenge anyone to come out with flows to implement this basic
> > flow..
> > > >> and test and see what i mean,, you will see how incomplete and hard
> > are
> > > >> things.. which should not be the case at all. NiFi shall be true
> Lego,
> > > add
> > > >> processors that says does XPTO and trust it will.. but we keep
> > finding a
> > > >> lot of nuances..
> > > >>
> > > >> I dont mind taking 1 day off my  and work have a meeting with some
> of
> > > you
> > > >> -
> > > >> dont know if theres such a thing as tech lead on nifi project? -
> and i
> > > >> think would be urgent to fix the foundations of some processors. Let
> > me
> > > >> know..
> > > >>
> > > >>
> > > >>
> > > >> Best Regards,
> > > >> *Emanuel Oliveira*
> > > >>
> > > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message