spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <>
Subject Re: Best storage format for intermediate process
Date Fri, 09 Oct 2015 23:16:30 GMT
Hi, Saif,

This depends on your use cases. For example, you want to do a table scan
every time? or you want to get a specific row? or you want to get a
temporal query? Do you have a security concern when you choose your
target-side data store?

Offloading a huge table is also very expensive. It is time consuming. If
the source side is mainframe, it could also eat a lot of MIPS. Thus, the
best way is to save it in a persistent media without any data
transformation and then transform and store them based on your query types.


Xiao Li

2015-10-09 11:25 GMT-07:00 <>:

> Hi all,
> I am in the procss of learning big data.
> Right now, I am bringing huge databases through JDBC to Spark (a 250
> million rows table can take around 3 hours), and then re-saving it into
> JSON, which is fast, simple, distributed, fail-safe and stores data types,
> although without any compression.
> Reading from distributed JSON takes for this amount of data, around 2-3
> minutes and works good enough for me. But, do you suggest or prefer any
> other format for intermediate storage, for fast and proper types reading?
> Not only for intermediate between a network database, but also for
> intermediate dataframe transformations to have data ready for processing.
> I have tried CSV but computational type inferring does not usually fit my
> needs and take long types. Haven’t tried parquet since they fixed it for
> 1.5, but that is also another option.
> What do you also think of HBase, Hive or any other type?
> Looking for insights!
> Saif

View raw message