avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: using Avro unions with HIVE
Date Thu, 23 May 2013 18:45:44 GMT
The Hive mailing list would have more info on the Avro SerDe usage.

In general, a system that does not have union types like Hive (or Pig,
etc) has to expand a union into multiple fields if there are more than one
non-null type -- and at most one branch of the union is not null.

For example a record with fields:

  {"name":"timestamp", "type":"long", "default":-1}
  {"name":"ipAddress", "type":["IPv4", "IPv6"]}

where IPv4 and IPv6 are previously defined types, would have to expand to
three fields
 "timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
the last two is not null in any given record.

I do not know what Hive's Avro SerDe does with unions.

On 5/23/13 7:15 AM, "Ran S" <rans@liveperson.com> wrote:

>Hi,
>We started to work with Avro in CDH4 and to query the Avro files using
>Hive.
>This does work fine for us, except for unions.
>We do not understand how to query the data inside a union using Hive.
>
>For example, let's look at the following schema:
>
>{
>	"type":"record", 
>	"name":"event", 
>	"namespace":"com.mysite",
>	"fields":[
>    {
>        "name":"header",
>        "type":{
>            "type":"record", "name":"CommonHeader",
>            "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
>},
>                      { "name":"globalUserId", "type":["null", "string"],
>"default":null } ]
>        },
>        "default":null
>    },
>    {
>        "name":"eventbody",
>        "type":{
>            "type":"record", "name":"eventbody",
>            "fields":[
>                {
>                    "name":"body",
>                    "type":[
>                       "null",
>                       {
>                        "type":"record",
>                        "name":"event1",
>                        "fields":[
>                            {
>                                "name":"event1Header",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            },
>                            {
>                                "name":"event1Body",
>                                "type":["null", { "type":"array",
>"items":"string" }], "default":null
>                            }
>                        ]
>                    },
>                   {
>                        "type":"record",
>                        "name":"event2",
>                        "fields":[
>                            {
>                                "name":"page",
>                                "type":{
>                                    "type":"record", "name":"URL",
>"fields":[{ "name":"url", "type":"string" }]
>                                },
>                                "default":null
>                            },
>                            {
>                                "name":"referrer", "type":"string",
>"default":null
>                            }
>                        ]
>                    }
>		],
>                    "default":null
>                }
>            ]
>        },
>        "default":null
>    }
>]}
>
>Note that "body" is a union of three types:
>null, "event1" and "event2"
>
>So if I want to query fields inside event1, I first need to access it.
>I then set a HiveQL like this:
>SELECT eventbody.body.??? from SRC
>
>My question is: what shoule I put in the ??? above to make this work?
>
>Thank you,
>Ran
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
>473.html
>Sent from the Avro - Users mailing list archive at Nabble.com.



Mime
View raw message