avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han JU <ju.han.fe...@gmail.com>
Subject Python avro performance
Date Fri, 09 Jan 2015 13:32:42 GMT
Hi,

I'm evaluating Avro to replace our csv based datasets and I notice a
performance problem in avro python bindings.
Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro
java bindings), reads and writes are fast (18s, 44s) but in python, for the
same file, it took nearly one hour to write, and 50 miniutes to read ...

My code is based on the avro documentation examples, and the schema is
relatively simple. My question:
  - Is this performance difference a known issue?
  - Is there something I miss (say a special configuration or something)?

I've seen a fastavro project and that is much faster in reading, but not
write support. This will prevent us from using Avro since we've lot of
python based programs that need to persist data.

Thanks!
-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Mime
View raw message