avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Re: Need help transforming Avro schemas
Date Thu, 21 Aug 2014 09:38:38 GMT
Hi Michael,

Thanks a lot for your suggestions, now I understand your idea of using your
schema checking method as a starting point for defining a method for
modifying an schema by traversing it. It will definitely take a look to
that approach. I will also try with Avro Schema IDL.

Thanks again for your help!

Greetings,

Juan


2014-08-20 20:52 GMT+02:00 Michael Pigott <mpigott.subscriptions@gmail.com>:

> Hi Juan!
>
> I originally considered showing you the AvroSchemaGenerator, but I thought
> it was a bit complex and very specific to XML Schema itself.  I think you
> would have better luck understanding how either Protobuf or Thrift schemas
> are converted to Avro instead, as those are more generic, and the feature
> set more closely maps to Avro.
>
> To answer your question, I never was able to find a use case where
> creating an Avro schema from only a list of fields worked for me.  That was
> okay in my case, because I could just use the corresponding XML element
> name and namespace when creating the record.  You might have better luck,
> depending on your use case?
>
> I unfortunately do not know of an existing tool that solves your problem,
> and I poked around the existing code and JIRA tickets for a bit and came up
> empty.  I originally thought you could write a clone function yourself, and
> create a new schema as you recursively descend through the old one, adding
> in any changes you wanted to make along the way.  (The comparison tool I
> showed you would make a good template.)
>
> That said, you might have better luck using the Avro Schema IDL[1], rather
> than rolling your own?
>
> Good luck!
> Mike
>
> [1] http://avro.apache.org/docs/1.7.7/idl.html
>
>
> On Wed, Aug 20, 2014 at 3:19 AM, Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Thanks a lot for your suggestion. I've found particularly interesting the
>> class
>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/main/java/org/apache/avro/xml/AvroSchemaGenerator.java,
>> which I understand generates an Avro schema by visiting an XML document. I
>> assume that you have used a fresh name for record in the node, otherwise
>> maybe you had encountere problems like the following: starting from an
>> Schema object 'personSchema' containing the following schema:
>>
>> {
>>   "type" : "record",
>>   "name" : "Person",
>>   "namespace" : "test",
>>   "doc" : "Schema for test.SchemasTest$Person",
>>   "fields" : [ {
>>     "name" : "age",
>>     "type" : "int"
>>   }, {
>>     "name" : "name",
>>     "type" : [ "null", "string" ]
>>   } ]
>> }
>>
>> The following code works ok
>>
>> Schema twoPersons = Schema.createRecord(      Arrays.asList(         new
>> Schema.Field(personSchema.getName() + "_1", personSchema, personSchema.
>> getDoc() + " _1", null),         new Schema.Field(personSchema.getName()
>> + "_2", personSchema, personSchema.getDoc() + " _2", null)       )  );
>>
>> but when I use the new Schema object twoPersons it's pretty easy to
>> encounter an exception, for example:
>>
>>     System.out.println(new Schema.Parser().setValidate(true).parse(
>> twoPersons.toString()))
>> throws
>>
>> org.apache.avro.SchemaParseException: No name in schema:
>> {"type":"record","fields":[{"name":"Person_1","type":{"type":"record","name":"Person","namespace":"test","doc":"Schema
>> for
>> test.SchemasTest$Person","fields":[{"name":"age","type":"int"},{"name":"name","type":["null","string"]}]},"doc":"Schema
>> for test.SchemasTest$Person
>> _1"},{"name":"Person_2","type":"test.Person","doc":"Schema for
>> test.SchemasTest$Person _2"}]}
>>     at org.apache.avro.Schema.getRequiredText(Schema.java:1221)
>>     at org.apache.avro.Schema.parse(Schema.java:1092)
>>     at org.apache.avro.Schema$Parser.parse(Schema.java:953)
>>     at org.apache.avro.Schema$Parser.parse(Schema.java:943)
>>     at
>> com.lambdoop.sdk.core.SchemasTest.createRecordFailTest(SchemasTest.java:232)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>     at
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>     at
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>     at
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>     at
>> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>     at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>     at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>     at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>     at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>     at
>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
>>     at
>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>     at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>
>>
>> Adding the name with twoPersons.addProp("name", "twoPersons") doesn't
>> work because "name" is a reserved property. SchemaBuilder cannot be used
>> either because it doesn't allow adding Schema objects to a field, but just
>> creating schemas from scratch.
>>
>> Other problem I have is that when I convert the schemas to Jackson's
>> JsonNode, and starting from an empty schema like
>>
>> {
>>   "type" : "record",
>>   "name" : "Person",
>>   "namespace" : "test",
>>   "fields" : [ ]
>> }
>>
>> if I add a field with schema Person by manipulating the JsonNode, when I
>> convert back to an Avro Schema object I get a "Can't redefine:
>> test.Person". My conclusions then are:
>> - every record needs to have a name
>> - two records with the same name must have the same schema
>>
>> That is not very surprising as it corresponds to what it's specified in
>> http://avro.apache.org/docs/current/spec.html. I was wondering If anyone
>> knows about a library for transforming Avro schemas that is able of doing
>> things like adding an existing schema as new field of another schema, that
>> has already dealt with these details.
>>
>> Thanks a lot for your help,
>>
>> Greetings,
>>
>> Juan Rodríguez
>>
>>
>>
>>
>>
>>
>> 2014-08-19 7:04 GMT-07:00 Michael Pigott <mpigott.subscriptions@gmail.com
>> >:
>>
>> Hi Juan,
>>>     That sounds really complex.  Would you instead be able to build or
>>> retrieve the original Avro Schema objects, and then build a new Schema from
>>> its definition?  For my work on transforming XML to Avro and back[1], I
>>> wrote a comparison tool to confirm that two Avro Schemas are equivalent by
>>> recursively descending through both schemas[2].  Perhaps you can use
>>> something similar to build a transformed Avro schema in memory, by applying
>>> your transformations on the fly?
>>>
>>> Good luck!
>>> Mike
>>>
>>> [1] https://issues.apache.org/jira/browse/AVRO-457
>>> [2]
>>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java
>>>
>>>
>>> On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hortala@gmail.com> wrote:
>>>
>>>> Hi list,
>>>>
>>>> I'm working on a project in Java where we have a DSL working on
>>>> GenericRecord objects, over which we define record transformation
>>>> operations like projections, filters and so. This implies that the avro
>>>> schema of the records evolves by adding and deleting record fields. As a
>>>> result the avro schemas used are different in each program depending on the
>>>> operations used. Hence I have to define avro schema transformations, and
>>>> generate new schemas as modifications of other schemas. For that the avro
>>>> schema builder classes are only useful for the starting schema, and so does
>>>> a pojo to schema mapping like avro-jackson. The main problem I face is that
>>>> in avro by design "schema objects are logically immutable", as stated in
>>>> the documentation. So far I've taken the way of converting the schema to
>>>> string, parsing it with jackson and manipulate it's representation as
>>>> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
>>>> have problems because avro records are named, and anonymous records are not
>>>> always legal in complete schemas; or because the same record name cannot
be
>>>> used twice in two child fields of a parent record. I was then thinking in
>>>> using generated schema names, with an increasing ID or a random UUID.
>>>> Anyway my question is, the approach I'm describing is correct?,  are you
>>>> aware of some library for creating new avro schemas by manipulating an
>>>> input schema? Maybe that capabilities are already present in avro's Java
>>>> API but I haven't noticed.
>>>>
>>>> Any help with be welcome. Thanks a lot in advance
>>>>
>>>> Greetings,
>>>>
>>>> Juan Rodríguez Hortalá
>>>>
>>>
>>>
>>
>

Mime
View raw message