Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
From: Ewan Leith <ewan.leith@realitymine.com>
To: Gavin Yue <yue.yuanyuan@gmail.com>
CC: user <user@spark.apache.org>
Subject: RE: Should I convert json into parquet?
Thread-Topic: Should I convert json into parquet?
Thread-Index: AQHRCR/v/E3pr5UvzEOSg1Yj+18GfZ5yPA0AgABSZLA=
Date: Mon, 19 Oct 2015 09:31:24 +0000
Message-ID: <ef7251ba53d84c1baf66971675b952b5@exchangemailbox.reality.mine>
References: 
 <CAKqDWF1FG5Efrxo6Lv_dEtiUfzRybKivTpUbLE8sk+4BThy9RA@mail.gmail.com>
 <21390F04-A22C-40F8-BCA6-A7CB41FE9A01@gmail.com>
In-Reply-To: <21390F04-A22C-40F8-BCA6-A7CB41FE9A01@gmail.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

As J=F6rn says, Parquet and ORC will get you really good compression and ca=
n be much faster. There also some nice additions around predicate pushdown =
which can be great if you've got wide tables.

Parquet is obviously easier to use, since it's bundled into Spark. Using OR=
C is described here http://hortonworks.com/blog/bringing-orc-support-into-a=
pache-spark/

Thanks,
Ewan

-----Original Message-----
From: J=F6rn Franke [mailto:jornfranke@gmail.com]=20
Sent: 19 October 2015 06:32
To: Gavin Yue <yue.yuanyuan@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: Should I convert json into parquet?


Good Formats are Parquet or ORC. Both can be useful with compression, such =
as Snappy.   They are much faster than JSON. however, the table structure i=
s up to you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue <yue.yuanyuan@gmail.com> wrote:
>=20
> I have json files which contains timestamped events.  Each event associat=
e with a user id.=20
>=20
> Now I want to group by user id. So converts from
>=20
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
>=20
> To intermediate storage.=20
> UserIDA -> (Event1, Event2...)
> UserIDB-> (Event3...)
>=20
> Then I will label positives and featurize the Events Vector in many diffe=
rent ways, fit each of them into the Logistic Regression.=20
>=20
> I want to save intermediate storage permanently since it will be used man=
y times.  And there will new events coming every day. So I need to update t=
his intermediate storage every day.=20
>=20
> Right now I store intermediate storage using Json files.  Should I use Pa=
rquet instead?  Or is there better solutions for this use case?
>=20
> Thanks a lot !
>=20
>=20
>=20
>=20
>=20
>=20

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional co=
mmands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org