spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <>
Subject RE: Should I convert json into parquet?
Date Mon, 19 Oct 2015 09:31:24 GMT
As Jörn says, Parquet and ORC will get you really good compression and can be much faster.
There also some nice additions around predicate pushdown which can be great if you've got
wide tables.

Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described


-----Original Message-----
From: Jörn Franke [] 
Sent: 19 October 2015 06:32
To: Gavin Yue <>
Cc: user <>
Subject: Re: Should I convert json into parquet?

Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy.   They
are much faster than JSON. however, the table structure is up to you and depends on your use

> On 17 Oct 2015, at 23:07, Gavin Yue <> wrote:
> I have json files which contains timestamped events.  Each event associate with a user
> Now I want to group by user id. So converts from
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> To intermediate storage. 
> UserIDA -> (Event1, Event2...)
> UserIDB-> (Event3...)
> Then I will label positives and featurize the Events Vector in many different ways, fit
each of them into the Logistic Regression. 
> I want to save intermediate storage permanently since it will be used many times.  And
there will new events coming every day. So I need to update this intermediate storage every
> Right now I store intermediate storage using Json files.  Should I use Parquet instead?
 Or is there better solutions for this use case?
> Thanks a lot !

To unsubscribe, e-mail: For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message