gobblin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhixiong Chen <zhc...@linkedin.com>
Subject Re: SchemaParseException When Writing to ORC File
Date Tue, 21 Nov 2017 18:34:04 GMT
Hi Tamas,

I'm not quite sure what issue Prateek has. Is it setting `avro.schema.url` to a schema registry
url breaks `HiveSerDeConverter` because it doesn't return the schema but a schema wrapper?

Based on my investigation, `avro.schema.url` is usually a path to the an avsc file. I guess
naming it as some url creates confusion. It doesn't seem to be used as a schema registry url.
Alternatively, `avro.schema.literal` can be used. Assign it to be the content of an avsc file
or the actual schema.

Speaking of schema registry support, I do think it's useful. We have something similar in
the existing codebase, check out `org.apache.gobblin.metrics.kafka.KafkaAvroEventReporter`.
However, the idea is not implemented for general availability. There are some related constructs
placed in "wrong" modules. We may start with new constructs which deprecate those.


From: Tamas Nemeth <tamas.nemeth@prezi.com>
Sent: Tuesday, November 21, 2017 6:02:41 AM
To: user@gobblin.incubator.apache.org
Cc: Engg_data_ingestion
Subject: Re: SchemaParseException When Writing to ORC File

Hey Zhixiong,

The issue here is Prateek tried set for the schema url the Confluent schema registry endpoint
address which sends back a message where the actual schema is wrapped under the schema property
 -> {"subject":"localhost.demo.Demo-value","version":1,"id":4,"schema":"{\"type\":\"record\",\"name\":\"Value\",\"namespace\":\"localhost.demo.Demo\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"Name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"Age\",\"type\":[\"null\",{\"type\":\"int\",\"connect.type\":\"int16\"}],\"default\":null},{\"name\":\"Department\",\"type\":[\"null\",\"string\"],\"default\":null}],\"connect.name<http://connect.name/>\":\"localhost.demo.Demo.Value\"}"}
connect.name&nbsp;-&nbsp;This website is for sale!&nbsp;-&nbsp;Connect Resources
and Information.<http://connect.name/>
This website is for sale! connect.name is your first and best source for all of the information
you’re looking for. From general topics to more of what you would expect to find here, connect.name
has it all. We hope you find what you are searching for!

Do you think it would make sense to support schema registries (or Confluent schema registry
at least as it is quite popular nowadays) wherever avro schema url can be set? I think it
would make. What do you think?
How you use in your environment? Do you have endpoint in your schema registry which reply
only with the actual schema?


On Tue, Nov 21, 2017 at 1:28 AM Zhixiong Chen <zhchen@linkedin.com<mailto:zhchen@linkedin.com>>

Hi Prateek,

Per the suggestion from Tamas, the direct response is not the schema but it contains a schema
field which has the schema as a json string. Is that the schema you're looking for?

Or, you're actually saying, the entire response is used for writing an avro file?


From: Prateek Gupta <prateek.gupta3@myntra.com<mailto:prateek.gupta3@myntra.com>>
Sent: Thursday, November 16, 2017 11:49:53 PM
To: user@gobblin.incubator.apache.org<mailto:user@gobblin.incubator.apache.org>
Cc: Engg_data_ingestion
Subject: Re: SchemaParseException When Writing to ORC File

Hi Tamas,

Thanks for the response.

The same schema is utilised for writing an Avro file also.
Since, the schema is registered with Schema Registry, the Avro message does not have the schema,
but a global identifier.

PFB, the endpoint used.


Prateek Gupta

On Fri, Nov 17, 2017 at 12:53 PM, Tamas Nemeth <tamas.nemeth@prezi.com<mailto:tamas.nemeth@prezi.com>>
Hey Prateek,

I think the problem here is that the Schema what you get from the Schema registry is not just
the Avro Schema. If you check the Schema in your message the actual Schema is in the schema

Does Confluent schema registry have an endpoint where you can get back the schema only?


On 2017. Nov 17., Fri at 7:11, Prateek Gupta <prateek.gupta3@myntra.com<mailto:prateek.gupta3@myntra.com>>

Please aid in resolving the aforementioned issue.

Prateek Gupta

On Wed, Nov 15, 2017 at 2:44 PM, Prateek Gupta <prateek.gupta3@myntra.com<mailto:prateek.gupta3@myntra.com>>

As per the documentation, Writing to an ORC File<https://gobblin.readthedocs.io/en/latest/case-studies/Writing-ORC-Data/#writing-to-an-orc-file>,
"In order to configure the HiveSerDeConverter avro.schema.url must be set when using this
deserializer so that the Hive SerDe knows what Avro Schema to use when converting the record."

If the URL is set to a Confluent Schema Registry (used for storing and retrieving Avro schemas)
address, it fails with below exception.

org.apache.avro.SchemaParseException: No type: {"subject":"localhost.demo.Demo-value","version":1,"id":4,"schema":"{\"type\":\"record\",\"name\":\"Value\",\"namespace\":\"localhost.demo.Demo\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"Name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"Age\",\"type\":[\"null\",{\"type\":\"int\",\"connect.type\":\"int16\"}],\"default\":null},{\"name\":\"Department\",\"type\":[\"null\",\"string\"],\"default\":null}],\"connect.name<http://connect.name/>\":\"localhost.demo.Demo.Value\"}"}

Please provide assistance in resolution for the same.

Prateek Gupta

View raw message