orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Hooton (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-200) json-schema and convert commands should support schema evolution of json documents
Date Thu, 25 May 2017 05:25:04 GMT
Shawn Hooton created ORC-200:
--------------------------------

             Summary: json-schema and convert commands should support schema evolution of
json documents
                 Key: ORC-200
                 URL: https://issues.apache.org/jira/browse/ORC-200
             Project: ORC
          Issue Type: Bug
          Components: Java
    Affects Versions: 1.5.0
            Reporter: Shawn Hooton
            Assignee: Shawn Hooton
         Attachments: example-v1.json, example-v2.json

Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json

Produces the following output:
create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for
the fields instance variable the generated DDL is sorted alphabetically and not ordered by
structure.  This causes problems for the convert command as well.

java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

*** output ommited for brevity

  "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
  "schema": [
    {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
        "about",
        "address",
        "age",
        "balance",
        "company",
        "email",
        "eyeColor",
        "favoriteFruit",
        "friends",
        "gender",
        "greeting",
        "guid",
        "id",
        "index",
        "isActive",
        "latitude",
        "longitude",
        "name",
        "phone",
        "picture",
        "registered",
        "tags"
      ],
*** output ommited for brevity

This causes *major* problems when a field is added to the JSON document later

e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json

Examine where the newField field is added in the example-v2.json document and then examine
the output below.  This also affects the convert command.

create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
*****  newField string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for
the fields instance variable so order is maintained across changes to the JSON schema.

Pull request *with* test cases incoming :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message