spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Best storage format for intermediate process
Date Fri, 09 Oct 2015 18:25:47 GMT
Hi all,

I am in the procss of learning big data.
Right now, I am bringing huge databases through JDBC to Spark (a 250 million rows table can
take around 3 hours), and then re-saving it into JSON, which is fast, simple, distributed,
fail-safe and stores data types, although without any compression.

Reading from distributed JSON takes for this amount of data, around 2-3 minutes and works
good enough for me. But, do you suggest or prefer any other format for intermediate storage,
for fast and proper types reading?
Not only for intermediate between a network database, but also for intermediate dataframe
transformations to have data ready for processing.

I have tried CSV but computational type inferring does not usually fit my needs and take long
types. Haven't tried parquet since they fixed it for 1.5, but that is also another option.
What do you also think of HBase, Hive or any other type?

Looking for insights!

View raw message