spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <gatorsm...@gmail.com>
Subject Re: Best storage format for intermediate process
Date Fri, 09 Oct 2015 23:16:30 GMT
Hi, Saif,

This depends on your use cases. For example, you want to do a table scan
every time? or you want to get a specific row? or you want to get a
temporal query? Do you have a security concern when you choose your
target-side data store?

Offloading a huge table is also very expensive. It is time consuming. If
the source side is mainframe, it could also eat a lot of MIPS. Thus, the
best way is to save it in a persistent media without any data
transformation and then transform and store them based on your query types.

Thanks,

Xiao Li


2015-10-09 11:25 GMT-07:00 <Saif.A.Ellafi@wellsfargo.com>:

> Hi all,
>
> I am in the procss of learning big data.
> Right now, I am bringing huge databases through JDBC to Spark (a 250
> million rows table can take around 3 hours), and then re-saving it into
> JSON, which is fast, simple, distributed, fail-safe and stores data types,
> although without any compression.
>
> Reading from distributed JSON takes for this amount of data, around 2-3
> minutes and works good enough for me. But, do you suggest or prefer any
> other format for intermediate storage, for fast and proper types reading?
> Not only for intermediate between a network database, but also for
> intermediate dataframe transformations to have data ready for processing.
>
> I have tried CSV but computational type inferring does not usually fit my
> needs and take long types. Haven’t tried parquet since they fixed it for
> 1.5, but that is also another option.
> What do you also think of HBase, Hive or any other type?
>
> Looking for insights!
> Saif
>
>

Mime
View raw message