avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joey Echeverria" <j...@cloudera.com>
Subject Re: Why Avro file format is larger than CSV?
Date Fri, 19 Sep 2014 14:21:10 GMT
What is the schema for the data?


If every field is a string, then you could end up in this situation. Your best bet is to use
compression for the Avro data. 




If you have a lot of CSV files that you want to convert to compressed Avro, there are some
command line tools in the Kite SDK[1] that might help. 




Check out this example:




http://kitesdk.org/docs/current/guide/Using-the-Kite-CLI-to-Create-a-Dataset/




-Joey




[1] http://kitesdk.org/docs/current/
—
Joey Echeverria

On Fri, Sep 19, 2014 at 3:31 AM, diplomatic Guru <diplomaticguru@gmail.com>
wrote:

> I've been experimenting with MapReduce job using CSV and avro format. What
> I find it strange is that Avro format is larger than CSV.
> For example, I exported some data in CSV, which is about 1.6GB. I then
> wrote a schema and a MapReduce job to take that CSV and serialize and write
> the output back to HDFS.
> When I checked the file size of the output, it was 2.4GB. I assumed that
> the size would be smaller because it convert the data into binary but I was
> wrong. Do you know what the reason is and refer me to some documentation on
> this?
> I've checked the .avro file and I could see that header contains the schema
> and the rest are data blocks.
Mime
View raw message