avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ginzburg <dav...@inner-active.com>
Subject RE: Generating snappy compressed avro files as hadoop map reduce input files
Date Sun, 13 Oct 2013 19:23:31 GMT
I am not generating the avro files with hadoop MR, but a different process.
I Plan to just store the files on s3 for potential archive processing with EMR.
Can I use AvroSequenceFile from a non M/R process to generate the sequence files having my
avro records as the values, and null keys ?
From: graham sanderson <graham@vast.com>
Sent: Sunday, October 13, 2013 9:16 PM
To: user@avro.apache.org
Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files

If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy
(block level compression may be best depending on your data)

On Oct 13, 2013, at 10:58 AM, David Ginzburg <davidg@inner-active.com<mailto:davidg@inner-active.com>>

As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work
with externally generated files.

Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29>
 serve as input files for a map reduce job, specially EMR jobs ?
From: Bertrand Dechoux <dechouxb@gmail.com<mailto:dechouxb@gmail.com>>
Sent: Sunday, October 13, 2013 6:36 PM
To: user@avro.apache.org<mailto:user@avro.apache.org>
Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files

I am not sure to understand the relation between your problem and the way the temporary data
are stored after the map phase.

However, I guess you are looking for a DataFileWriter and its setCodec function.



PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards
but really a specific file containing compressed blocks. This principle is similar to the
SequenceFile's. Maybe that's what you mean by different snappy codec?

On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <davidg@inner-active.com<mailto:davidg@inner-active.com>>

I am writing an application that produces avro record files , to be stored on AWS S3 as possible
input to EMR.
I would like to compress with snappy codec before storing them on S3.
It is my understanding that hadoop currently uses a different snappy codec, mostly used as
intermediate map output format .
My question is how can I generate within my application logic (not MR) snappy compressed avro

View raw message