avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Questions re integrating Avro into Cascading process
Date Fri, 16 Apr 2010 18:28:16 GMT

On Apr 16, 2010, at 11:20 AM, Ken Krugler wrote:

> Hi Scott,
> 
> Thanks for the response. See below for my comments...
> 
>> 
>> Correct me if I'm wrong, but its notion of a record is very simple  
>> -- there are no arrays or maps -- just a list of fields.
>> This maps to avro easily.
> 
> Correct - currently Cascading doesn't have built-in support for  
> arrays, maps or unions - though I believe arrays & maps are on the list.
> 

It would be great if Cascading, Pig, and Hive (along with Avro) could get to some good common
ground on all of these data types.


>> Creating an Avro schema programmatically is fairly straightforward  
>> -- especially without arrays, maps, or unions.  If the code has  
>> access to the Cascading record definition, transforming that into an  
>> Avro schema dynamically should be straightforward. Schema has  
>> various constructors and static methods from which you can get the  
>> JSON schema representation or just pass around Schema objects.
> 
> We're currently using the string rep, since a Schema isn't  
> serializable, and Cascading needs that to save the defined workflow in  
> the job conf.
> 

That should work well.  The JSON string representation is the canonical, cross-language, serialization
of an Avro schema.

> 
> So far one issue is that we need to translate between Cascading  
> Strings and Avro Utf8 types, but most everything else works just fine.
> 

Let us know about the difficulties here and any suggestions or requests for enhancement. 

I am interested in making the String <> Utf8 situation more efficient and easier to
use.


>> One can go farther and use AvroWrapper and o.a.avro.mapred define  
>> the M/R jobs enabling a lot of other possibilities.  I can't  
>> confidently state what all the requirements are here outside of  
>> doing the Cascading record <> Avro schema translation and changing  
>> all the touch points that Cascading has on the K/V types.
> 
> It's pretty much four routines in the scheme:
> 
> - sinkInit (setting up the conf properly, for which we're using the  
> AvroJob support)
> - sourceInit (same thing)
> 
> - sink (mapping from Tuple to o.a.avro.Generic.GenericData)
> - source (mapping from o.a.avro.Generic.GenericData to Tuple)
> 
> The above is all based on the Avro mapred support, so we just have to  
> do the translation work for Fields <-> Schema and Tuple <-> GenericData.
> 
> It looks pretty doable, thanks for the help!
> 
> -- Ken
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 


Mime
View raw message