avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Avro files and Pig/Hive
Date Thu, 25 Mar 2010 17:30:23 GMT
I have been doing some research into figuring out how to read (and later write) Avro container
files in Pig and Hive.

This has brought up some interesting challenges.  Below are some of my thoughts on the situation
so far.  I'm sure some Avro JIRA tickets will result eventually. 

>From my preliminary work, mapping Pig to Avro should be relatively easy since the main
data types map to each other fairly cleanly.  Both have Maps and Arrays/Bags, for example,
and the maps require string keys.
Making an arbitrary reader/writer will be a bit more of a challenge, but the API in 0.7 should
be better (http://issues.apache.org/jira/browse/PIG-966 http://wiki.apache.org/pig/LoadStoreRedesignProposal).
I wish I had time to make sure their new proposal was sufficient to handle Avro files as cleanly
and efficiently as possible before it gets into an official release.

Pig may require a lot of 'hidden' unions with null in the schemas if it is used to write generically.
 The use case best matches the Generic API now, but something else down the road may be better.

The Hive type system can almost map to Avro completely.  They support arrays, maps, and structs.
  Their maps however can have any intrinsic type as a key (int, long, string, float, double).
  Other than that, arrays are arrays, and structs are records.   Avro files should be better
performing and more compact than sequence files.

** Unions are a challenge
Unions are a challenge in both.  Currently I am using Pig with a custom LoadFunc and for each
Union I have I generate a field for each non-null branch and a field to specify which branch
is used.  This is ... not a good long term solution.  For example {"name":"myField", "type":"union",
"branches": ["string", "bytes"]} would generate three pig fields, myFieldString, myFieldBytes,
In Hive, that hack could work and be equally ugly, or possibly a "table family" could be created
for certain union types with a table per branch.  In other cases a custom operation is needed.
Example 1, small 'leaf' union:  I have a field that is a union of a String and a fixed byte[16].
  In my custom Pig script I just convert the bytes to hex and always use a string, generating
one field.   I could also just create one field as a variable length byte[] type and use the
utf8 bytes of the string.  In my case the string is always more than 32 characters, so there
are no collisions between the branches in either.   These custom field mappings cannot be
done with a generic "read any avro file in Pig/Hive" class. 

Example 2, large 'branch' union:  Some unions are unions of many lager more complicated records.
 In Pig this can map to a SPLIT (several record streams from one source) or in Hive a 'table
family' but neither can be currently done naturally or automatically -- a fully custom reader
/ writer for each schema that has such a 'branch union' in it is necessary.

Getting some sort of union-type features added to both would be beneficial, even if these
are restricted in scope and only cover a few more common use cases.

** Avro enhancements
Both Specific and Generic APIs lead to extra object overhead here.  For example in Pig one
creates the Avro object then reads its fields and copies them into a pig Schema.  Lower level
readers are better -- ideally the Pig reader gets callbacks for each field it is interested
in in the order it expects (reader schema order) and fills out its own object.  I think some
of our Decoders can operate that way.  A Pig feature that makes it easier to construct tuples
out-of-order (writer schema order) would be useful too.

Hive has a lot of projection features that could be served well by slightly different file
formats (for example, the ability to skip variable length fields faster -- a per record map
of field sizes perhaps -- could be useful).

Neither will support recursive schemas.  Is there a quick way to check if a schema is recursive?
 In general, some features in Avro to make it easier to 'categorize' a schema would be beneficial.
View raw message