avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Avro and Hive
Date Mon, 01 Nov 2010 16:32:15 GMT

On Oct 28, 2010, at 11:43 AM, Ken Krugler wrote:

> Hi all,
> I'd seen past emails from Scott and Doug about using Avro as the data  
> format for Hive.
> This was back in April/May, and I'm wondering about current state of  
> the world.

I use avro data files readable from both Pig and Hive.  Currently this is on our custom schemas
and not general purpose read/write.  A key future benefit is more easy sharing of arbitrary
data inputs/outputs between Hive, Pig, Java M/R, and anything else.   Note that, the Howl
problem is also attacking this problem.  It might make sense to contribute an Avro backend
and adapter for that project, so that an avro data file can be read in as a table in Howl.
 That would be a while off (~year?) at best if I am the contributor.
> Specifically, what's the recommended approach (& known issues) with  
> using Avro files with Hive?
> E.g. Scott mentioned that "Avro files should be better performing and  
> more compact than sequence files." Has that been proven out?

In general this will be the case.  SequenceFiles use Writables.  These don't often store data
in binary format as compact as avro does, and are often composed together to form more complicated
writables.  Reading and writing each writable tends to lead to more fine grained access to
the stream which is slower. 

However, one could make a specialized writable for a specific data type or dataset that is
very fast and would out-do what a general purpose tool like Avro does.  For the most part
in general use it is the slightly smaller size of Avro that would more likely be noticed than
any performance difference.  Compression overhead dwarfs minor performance differences.  The
goal of serialization performance in my opinion, is to make your choice of compression be
the primary factor in performance.

There is currently no equivalent of MapFile or a columnar storage format in Avro (yet). 

> He also discussed a minor issue with maps - "Their maps however can  
> have any intrinsic type as a key (int, long, string, float, double)."

Represent arbitrary map as an Avro array of (key, value) tuples.  This has no restrictions.
 In some sense, that is all an Avro Map is anyway, a special case Array of (key, value) with
a default map data structure in the language API.  Some schema metadata might be required
to give Hive the hint that it can treat an array as a map.

> And a more serious issue with unions, though this wouldn't directly  
> impact us as we wouldn't be using that feature.

I have dealt with unions by exposing all branches of the union to Hive and/or Pig.   Branches
that are not taken are null.  In some cases I expose an extra field for the branch taken.
  This does not work in all cases, in particular recursive schemas can be troublesome.

> In our situation, we're trying to get the best of both worlds by  
> leveraging Hive for analytics, and Cascading for workflow, so having  
> one store in HDFS for both would be a significant win.

Same here but we replaced Cascading with Pig.  We have exposed data to both, but need the
ability to create a table in Hive and read from Pig, and vice-versa.  Avro should be a great
tool for that, and leave the data open for many more things to access as well. 

I will be working on completing the Avro PigStorage late this year, and then have a look at
Hive early next year. 

> Thanks for any input!
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g

View raw message