Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0EFE218725 for ; Wed, 14 Oct 2015 15:45:43 +0000 (UTC) Received: (qmail 82950 invoked by uid 500); 14 Oct 2015 15:45:33 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 82832 invoked by uid 500); 14 Oct 2015 15:45:33 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 82821 invoked by uid 99); 14 Oct 2015 15:45:32 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Oct 2015 15:45:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7014FC40B6 for ; Wed, 14 Oct 2015 15:45:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.129 X-Spam-Level: *** X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ueE1VhrweZ1B for ; Wed, 14 Oct 2015 15:45:31 +0000 (UTC) Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 5E24E439E8 for ; Wed, 14 Oct 2015 15:45:31 +0000 (UTC) Received: by wiyb4 with SMTP id b4so4217276wiy.0 for ; Wed, 14 Oct 2015 08:45:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=x0krPY4aIMICCtutq6pj+is4freHD2S4p8FgRcAxDmM=; b=S8skH+GltQWpaLBoKaiviKnRf7H3IblkbCke31LHSoVWy2OBeRA5effa99Z38yPHMn UzbIaoNyjo0D0nOolGH4FMFmEqtV2Dol+pIUj+UFy04rf5b9Arcc4sv5HEkWFr672das GIc2yWpyvwGybeqaoXrsoZiuHIjnelbQ8QriwayV6Q+BarZ/cPgZZUcKTeScxQh2RHxM f9Do+pNAdUpnZ9mlvZQ/rzDNmOUFBnUC6TVd1jkQ7s6VRFsDWhuDN7d9/L8hr/tmEVX3 l3fMi2QdYGzha6K50ecTVsFdfEqjRUvieKxfOFOkd124E43z/NcDFAyX3U+Uddn/5i2B JCCg== X-Received: by 10.180.206.52 with SMTP id ll20mr5038376wic.48.1444837530554; Wed, 14 Oct 2015 08:45:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.27.88.146 with HTTP; Wed, 14 Oct 2015 08:45:11 -0700 (PDT) From: Cesar Flores Date: Wed, 14 Oct 2015 10:45:11 -0500 Message-ID: Subject: Question about data frame partitioning in Spark 1.3.0 To: user Content-Type: multipart/alternative; boundary=001a11c3879ac459be05221273ab --001a11c3879ac459be05221273ab Content-Type: text/plain; charset=UTF-8 My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to partition by the user id? In other words trying something like the next lines in all my user data tables will improve the performance in the long run?: val partitioned_rdd = unpartitioned_df .map(row=>(row.getLong(0), row)) .partitionBy(new HashPartitioner(200)) .map(x => x._2) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores --001a11c3879ac459be05221273ab Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

My current version of spark is 1.3.0 a= nd my question is the next:

I have large data fram= es where the main field is an user id. I need to do many group by's and= joins using that field. Do the performance will increase if before doing a= ny group by or join operation I first convert to rdd to partition by the us= er id? In other words trying something like the next lines in all my user d= ata tables will improve the performance in the long run?:

val partitioned_rdd =3D unpartitioned_df
=C2=A0 =C2=A0.ma= p(row=3D>(row.getLong(0), row))
=C2=A0 =C2=A0.partitionBy(new = HashPartitioner(200))
=C2=A0 =C2=A0.map(x =3D> x._2)

val partitioned_df =3D hc.createDataFrame(partitioned_rdd, = unpartitioned_df.schema)



=

Thanks a lot
--=C2=A0
Cesar Flores
--001a11c3879ac459be05221273ab--