nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Kawamura <ijokaruma...@gmail.com>
Subject Re: NiFi 1.2.0 Record processors question
Date Mon, 22 May 2017 01:59:14 GMT
I've updated the JIRA description to cover not only embedded Avro
schema but also ones such as derived from CSVReader.
https://issues.apache.org/jira/browse/NIFI-3921

Thanks,
Koji

On Sat, May 20, 2017 at 4:14 AM, Joe Gresock <jgresock@gmail.com> wrote:
> Yes, both of your examples help explain the use of the CSV header parsing.
>
> I think I have a much better understanding of the new framework now, thanks
> to Bryan and Matt.  Mission accomplished!
>
> On Fri, May 19, 2017 at 7:04 PM, Bryan Bende <bbende@gmail.com> wrote:
>
>> When a reader produces a record it attaches the schema it used to the
>> record, but we currently don't have a way for a writer to use that
>> schema when writing a record, although I think we do want to support
>> that... something like a "Use Schema in Record" option as a choice in
>> the 'Schema Access Strategy' of writers.
>>
>> For now, when a processor uses a reader and a writer, and you also
>> want to read and write with the same schema, then you would still have
>> to define the same schema for the writer to use even if you had a CSV
>> reader that inferred the schema from the headers.
>>
>> There are some processors that only use a reader, like
>> PutDabaseRecord, where using the CSV header would still be helpful.
>>
>> There are also a lot of cases where you where you would write with a
>> different schema then you read with, so using the CSV header for
>> reading is still helpful in those cases too.
>>
>> Hopefully I am making sense and not confusing things more.
>>
>>
>> On Fri, May 19, 2017 at 1:27 PM, Joe Gresock <jgresock@gmail.com> wrote:
>> > Matt,
>> >
>> > Great response, this does help explain a lot.  Reading through your post
>> > made me realize I didn't understand the AvroSchemaRegistry.  I had been
>> > thinking it was something that nifi processors populated, but I re-read
>> its
>> > usage description and it does indeed say to use dynamic properties for
>> the
>> > schema name / value.  In that case, I can definitely see how this is not
>> > dynamic in the sense of inferring any schemas on the flow.  It makes me
>> > wonder if there would be a way to populate the schema registry from flow
>> > files.  When I first glanced at the processors, I had assumed that when
>> the
>> > schema was inferred from the CSV headers, it would create an entry in the
>> > AvroSchemaRegistry, provided you filled in the correct properties.
>> Clearly
>> > this is not how it works.
>> >
>> > Just so I understand, does your first paragraph mean that even if you use
>> > the CSV headers to determine the schema, you still can't use the *Record
>> > processors unless you manually register a matching schema in the schema
>> > registry, or otherwise somehow set the schema in an attribute?  In this
>> > case, it almost seems like inferring the schema from the CSV headers
>> serves
>> > no purpose, and I don't see how NIFI-3921 would alleviate that (it only
>> > appears to address avro flow files with embedded schema).
>> >
>> > Based on this understanding, I was able to successfully get the following
>> > flow working:
>> > InferAvroSchema -> QueryRecord.
>> >
>> > QueryRecord uses CSVReader with "Use Schema Text Property" and Schema
>> Text
>> > set to ${inferred.avro.schema} (which is populated by the InferAvroSchema
>> > processor).  It also uses JsonRecordSetWriter with a similar
>> > configuration.  I could attach a template, but I don't know the best way
>> to
>> > do that on the listserve.
>> >
>> > Joe
>> >
>> > On Fri, May 19, 2017 at 4:59 PM, Matt Burgess <mattyb149@apache.org>
>> wrote:
>> >
>> >> Joe,
>> >>
>> >> Using the CSV Headers to determine the schema is currently the only
>> >> "dynamic" schema strategy, so it will be tricky to use with the other
>> >> Readers/Writers and associated processors (which require an explicit
>> >> schema). This should be alleviated with NIFI-3291 [1].  For this first
>> >> release, I believe the approach would be to identify the various
>> >> schemas for your incoming/outgoing data, create a Schema Registry with
>> >> all of them, then the various Record Readers/Writers using those.
>> >>
>> >> There were some issues during development related to using the
>> >> incoming vs outgoing schema for various record operations, if
>> >> QueryRecord seems to be using the output schema for querying then it
>> >> is likely a bug. However in this case it might just be that you need
>> >> an explicit schema for your Writer that matches the input schema
>> >> (which is inferred from the CSV header). The CSV Header inference
>> >> currently assumes all fields are Strings, so a nominal schema would
>> >> have the same number of fields as columns, each with type String. If
>> >> you don't know the number of columns and/or the column names are
>> >> dynamic per CSV file, I believe we have a gap here (for now).
>> >>
>> >> I thought of maybe having a InferRecordSchema processor that would
>> >> fill in the avro.text attribute for use in various downstream record
>> >> readers/writers, but inferring schemas in general is not a trivial
>> >> task. An easier interim solution might be to have an
>> >> AddSchemaAsAttribute processor, which takes a Reader to parse the
>> >> records and determine the schema (whether dynamic or static), and set
>> >> the avro.text attribute on the original incoming flow file, then
>> >> transfer the original flow file. This would require two reads, one by
>> >> AddSchemaAsAttribute and one by the downstream record processor, but
>> >> it should be fairly easy to implement.  Then again, since new features
>> >> would go into 1.3.0, hopefully NIFI-3921 will be implemented by then,
>> >> rendering all this moot :)
>> >>
>> >> Regards,
>> >> Matt
>> >>
>> >> [1] https://issues.apache.org/jira/browse/NIFI-3921
>> >>
>> >> On Fri, May 19, 2017 at 12:25 PM, Joe Gresock <jgresock@gmail.com>
>> wrote:
>> >> > I've tried a couple different configurations of CSVReader /
>> >> > JsonRecordSetWriter with the QueryRecord processor, and I don't think
>> I
>> >> > quite have the usage down yet.
>> >> >
>> >> > Can someone give a basic example configuration in the following 2
>> >> > scenarios?  I followed most of Matt Burgess's response to the post
>> titled
>> >> > "How to use ConvertRecord Processor", but I don't think it tells the
>> >> whole
>> >> > story.
>> >> >
>> >> > 1) QueryRecord, converting CSV to JSON, using only the CSV headers
to
>> >> > determine the schema.  (I tried selecting Use String Fields from
>> Header
>> >> in
>> >> > CSVReader, but the processor really seems to want to use the
>> >> > JsonRecordSetWriter to determine the schema)
>> >> >
>> >> > 2) QueryRecord, converting CSV to JSON, using a cached avro schema.
 I
>> >> > imagine I need to use InferAvroSchema here, but I'm not sure how to
>> cache
>> >> > it in the AvroSchemaRegistry.  Also not quite sure how to configure
>> the
>> >> > properties of each controller service in this case.
>> >> >
>> >> > Any help would be appreciated.
>> >> >
>> >> > Joe
>> >> >
>> >> > --
>> >> > I know what it is to be in need, and I know what it is to have
>> plenty.  I
>> >> > have learned the secret of being content in any and every situation,
>> >> > whether well fed or hungry, whether living in plenty or in want.  I
>> can
>> >> do
>> >> > all this through him who gives me strength.    *-Philippians 4:12-13*
>> >>
>> >
>> >
>> >
>> > --
>> > I know what it is to be in need, and I know what it is to have plenty.  I
>> > have learned the secret of being content in any and every situation,
>> > whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*

Mime
View raw message