nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Gresock <jgres...@gmail.com>
Subject Re: NiFi 1.2.0 Record processors question
Date Fri, 19 May 2017 19:14:20 GMT
Yes, both of your examples help explain the use of the CSV header parsing.

I think I have a much better understanding of the new framework now, thanks
to Bryan and Matt.  Mission accomplished!

On Fri, May 19, 2017 at 7:04 PM, Bryan Bende <bbende@gmail.com> wrote:

> When a reader produces a record it attaches the schema it used to the
> record, but we currently don't have a way for a writer to use that
> schema when writing a record, although I think we do want to support
> that... something like a "Use Schema in Record" option as a choice in
> the 'Schema Access Strategy' of writers.
>
> For now, when a processor uses a reader and a writer, and you also
> want to read and write with the same schema, then you would still have
> to define the same schema for the writer to use even if you had a CSV
> reader that inferred the schema from the headers.
>
> There are some processors that only use a reader, like
> PutDabaseRecord, where using the CSV header would still be helpful.
>
> There are also a lot of cases where you where you would write with a
> different schema then you read with, so using the CSV header for
> reading is still helpful in those cases too.
>
> Hopefully I am making sense and not confusing things more.
>
>
> On Fri, May 19, 2017 at 1:27 PM, Joe Gresock <jgresock@gmail.com> wrote:
> > Matt,
> >
> > Great response, this does help explain a lot.  Reading through your post
> > made me realize I didn't understand the AvroSchemaRegistry.  I had been
> > thinking it was something that nifi processors populated, but I re-read
> its
> > usage description and it does indeed say to use dynamic properties for
> the
> > schema name / value.  In that case, I can definitely see how this is not
> > dynamic in the sense of inferring any schemas on the flow.  It makes me
> > wonder if there would be a way to populate the schema registry from flow
> > files.  When I first glanced at the processors, I had assumed that when
> the
> > schema was inferred from the CSV headers, it would create an entry in the
> > AvroSchemaRegistry, provided you filled in the correct properties.
> Clearly
> > this is not how it works.
> >
> > Just so I understand, does your first paragraph mean that even if you use
> > the CSV headers to determine the schema, you still can't use the *Record
> > processors unless you manually register a matching schema in the schema
> > registry, or otherwise somehow set the schema in an attribute?  In this
> > case, it almost seems like inferring the schema from the CSV headers
> serves
> > no purpose, and I don't see how NIFI-3921 would alleviate that (it only
> > appears to address avro flow files with embedded schema).
> >
> > Based on this understanding, I was able to successfully get the following
> > flow working:
> > InferAvroSchema -> QueryRecord.
> >
> > QueryRecord uses CSVReader with "Use Schema Text Property" and Schema
> Text
> > set to ${inferred.avro.schema} (which is populated by the InferAvroSchema
> > processor).  It also uses JsonRecordSetWriter with a similar
> > configuration.  I could attach a template, but I don't know the best way
> to
> > do that on the listserve.
> >
> > Joe
> >
> > On Fri, May 19, 2017 at 4:59 PM, Matt Burgess <mattyb149@apache.org>
> wrote:
> >
> >> Joe,
> >>
> >> Using the CSV Headers to determine the schema is currently the only
> >> "dynamic" schema strategy, so it will be tricky to use with the other
> >> Readers/Writers and associated processors (which require an explicit
> >> schema). This should be alleviated with NIFI-3291 [1].  For this first
> >> release, I believe the approach would be to identify the various
> >> schemas for your incoming/outgoing data, create a Schema Registry with
> >> all of them, then the various Record Readers/Writers using those.
> >>
> >> There were some issues during development related to using the
> >> incoming vs outgoing schema for various record operations, if
> >> QueryRecord seems to be using the output schema for querying then it
> >> is likely a bug. However in this case it might just be that you need
> >> an explicit schema for your Writer that matches the input schema
> >> (which is inferred from the CSV header). The CSV Header inference
> >> currently assumes all fields are Strings, so a nominal schema would
> >> have the same number of fields as columns, each with type String. If
> >> you don't know the number of columns and/or the column names are
> >> dynamic per CSV file, I believe we have a gap here (for now).
> >>
> >> I thought of maybe having a InferRecordSchema processor that would
> >> fill in the avro.text attribute for use in various downstream record
> >> readers/writers, but inferring schemas in general is not a trivial
> >> task. An easier interim solution might be to have an
> >> AddSchemaAsAttribute processor, which takes a Reader to parse the
> >> records and determine the schema (whether dynamic or static), and set
> >> the avro.text attribute on the original incoming flow file, then
> >> transfer the original flow file. This would require two reads, one by
> >> AddSchemaAsAttribute and one by the downstream record processor, but
> >> it should be fairly easy to implement.  Then again, since new features
> >> would go into 1.3.0, hopefully NIFI-3921 will be implemented by then,
> >> rendering all this moot :)
> >>
> >> Regards,
> >> Matt
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-3921
> >>
> >> On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <jgresock@gmail.com>
> wrote:
> >> > I've tried a couple different configurations of CSVReader /
> >> > JsonRecordSetWriter with the QueryRecord processor, and I don't think
> I
> >> > quite have the usage down yet.
> >> >
> >> > Can someone give a basic example configuration in the following 2
> >> > scenarios?  I followed most of Matt Burgess's response to the post
> titled
> >> > "How to use ConvertRecord Processor", but I don't think it tells the
> >> whole
> >> > story.
> >> >
> >> > 1) QueryRecord, converting CSV to JSON, using only the CSV headers to
> >> > determine the schema.  (I tried selecting Use String Fields from
> Header
> >> in
> >> > CSVReader, but the processor really seems to want to use the
> >> > JsonRecordSetWriter to determine the schema)
> >> >
> >> > 2) QueryRecord, converting CSV to JSON, using a cached avro schema.  I
> >> > imagine I need to use InferAvroSchema here, but I'm not sure how to
> cache
> >> > it in the AvroSchemaRegistry.  Also not quite sure how to configure
> the
> >> > properties of each controller service in this case.
> >> >
> >> > Any help would be appreciated.
> >> >
> >> > Joe
> >> >
> >> > --
> >> > I know what it is to be in need, and I know what it is to have
> plenty.  I
> >> > have learned the secret of being content in any and every situation,
> >> > whether well fed or hungry, whether living in plenty or in want.  I
> can
> >> do
> >> > all this through him who gives me strength.    *-Philippians 4:12-13*
> >>
> >
> >
> >
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message