spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gavin Yue <yue.yuany...@gmail.com>
Subject Should I convert json into parquet?
Date Sat, 17 Oct 2015 21:07:45 GMT
I have json files which contains timestamped events.  Each event associate
with a user id.

Now I want to group by user id. So converts from

Event1 -> UserIDA;
Event2 -> UserIDA;
Event3 -> UserIDB;

To intermediate storage.
UserIDA -> (Event1, Event2...)
UserIDB-> (Event3...)

Then I will label positives and featurize the Events Vector in many
different ways, fit each of them into the Logistic Regression.

I want to save intermediate storage permanently since it will be used many
times.  And there will new events coming every day. So I need to update
this intermediate storage every day.

Right now I store intermediate storage using Json files.  Should I use
Parquet instead?  Or is there better solutions for this use case?

Thanks a lot !

Mime
View raw message