avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wai Yip Tung ...@tungwaiyip.info>
Subject Re: Hadoop stream gzipped file with AvroAsTextInputFormat
Date Fri, 25 Jul 2014 00:22:03 GMT
I think I've figured out how to make this work.

Initially I have a file "data.avro". I gzip it as "data.avro.gz" and try 
to feed it to Hadoop. This does not work.

Instead Avro supports "deflate" codec natively. So I transcode it into 
"data_deflate.avro" and feed it to hadoop and it works correctly. The 
file size is slight larger than if I gzip it as a whole.

I was using avro-tools to do the transcoding. It's command line handling 
is irregular. It takes me many trial and error to get it to work. The 
command that works for me is

   java -jar avro-tools-1.7.6.jar recodec --codec=deflate input.avro 

Wai Yip

> wy@tungwaiyip.info <mailto:wy@tungwaiyip.info>
> Wednesday, July 23, 2014 5:07 PM
> I have successfully stream Avro data file to Python mrjobs using the 
> library AvroAsTextInputFormat
> -inputformat org.apache.avro.mapred.AvroAsTextInputFormat
> However, unlike text file, it does not seems to handle gzipped file 
> automatically. What can I do to stream a gzipped Avro file?
> Wai Yip

View raw message