Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 41E6818D25 for ; Mon, 19 Oct 2015 09:31:57 +0000 (UTC) Received: (qmail 67526 invoked by uid 500); 19 Oct 2015 09:31:44 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 67430 invoked by uid 500); 19 Oct 2015 09:31:44 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 67418 invoked by uid 99); 19 Oct 2015 09:31:44 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2015 09:31:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3F479C654B for ; Mon, 19 Oct 2015 09:31:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.001 X-Spam-Level: * X-Spam-Status: No, score=1.001 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id IWaAeJdXdQ1w for ; Mon, 19 Oct 2015 09:31:35 +0000 (UTC) Received: from dispatch1-eu1.ppe-hosted.com (dispatch1-eu1.ppe-hosted.com [62.209.50.28]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id CCC5142BA7 for ; Mon, 19 Oct 2015 09:31:34 +0000 (UTC) Received: from mx2-eu1.ppe-hosted.com (unknown [10.70.45.19]) by dispatch1-eu1.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTP id C0AAE200B8; Mon, 19 Oct 2015 09:31:28 +0000 (UTC) X-Virus-Scanned: Proofpoint Essentials engine Received: from exchangemailbox.reality.mine (wbc.realitymine.com [85.199.233.34]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx2-eu1.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTPS id C2365471; Mon, 19 Oct 2015 09:31:25 +0000 (UTC) Received: from exchangemailbox.reality.mine (192.168.37.27) by exchangemailbox.reality.mine (192.168.37.27) with Microsoft SMTP Server (TLS) id 15.0.847.32; Mon, 19 Oct 2015 10:31:24 +0100 Received: from exchangemailbox.reality.mine ([fe80::fd18:4b88:688c:fdce]) by exchangemailbox.reality.mine ([fe80::fd18:4b88:688c:fdce%12]) with mapi id 15.00.0847.030; Mon, 19 Oct 2015 10:31:24 +0100 From: Ewan Leith To: Gavin Yue CC: user Subject: RE: Should I convert json into parquet? Thread-Topic: Should I convert json into parquet? Thread-Index: AQHRCR/v/E3pr5UvzEOSg1Yj+18GfZ5yPA0AgABSZLA= Date: Mon, 19 Oct 2015 09:31:24 +0000 Message-ID: References: <21390F04-A22C-40F8-BCA6-A7CB41FE9A01@gmail.com> In-Reply-To: <21390F04-A22C-40F8-BCA6-A7CB41FE9A01@gmail.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.37.1] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MDID: imiCL9whPb62 As J=F6rn says, Parquet and ORC will get you really good compression and ca= n be much faster. There also some nice additions around predicate pushdown = which can be great if you've got wide tables. Parquet is obviously easier to use, since it's bundled into Spark. Using OR= C is described here http://hortonworks.com/blog/bringing-orc-support-into-a= pache-spark/ Thanks, Ewan -----Original Message----- From: J=F6rn Franke [mailto:jornfranke@gmail.com]=20 Sent: 19 October 2015 06:32 To: Gavin Yue Cc: user Subject: Re: Should I convert json into parquet? Good Formats are Parquet or ORC. Both can be useful with compression, such = as Snappy. They are much faster than JSON. however, the table structure i= s up to you and depends on your use case. > On 17 Oct 2015, at 23:07, Gavin Yue wrote: >=20 > I have json files which contains timestamped events. Each event associat= e with a user id.=20 >=20 > Now I want to group by user id. So converts from >=20 > Event1 -> UserIDA; > Event2 -> UserIDA; > Event3 -> UserIDB; >=20 > To intermediate storage.=20 > UserIDA -> (Event1, Event2...) > UserIDB-> (Event3...) >=20 > Then I will label positives and featurize the Events Vector in many diffe= rent ways, fit each of them into the Logistic Regression.=20 >=20 > I want to save intermediate storage permanently since it will be used man= y times. And there will new events coming every day. So I need to update t= his intermediate storage every day.=20 >=20 > Right now I store intermediate storage using Json files. Should I use Pa= rquet instead? Or is there better solutions for this use case? >=20 > Thanks a lot ! >=20 >=20 >=20 >=20 >=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional co= mmands, e-mail: user-help@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org