Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
From: Cesar Flores <cesar7@gmail.com>
Date: Wed, 14 Oct 2015 10:45:11 -0500
Message-ID: 
 <CAJS65-Wa7hEax5FyS2ah1=WEovLKVjs4k-yhX5PG_Qf-OKjcGw@mail.gmail.com>
Subject: Question about data frame partitioning in Spark 1.3.0
To: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11c3879ac459be05221273ab

--001a11c3879ac459be05221273ab
Content-Type: text/plain; charset=UTF-8

My current version of spark is 1.3.0 and my question is the next:

I have large data frames where the main field is an user id. I need to do
many group by's and joins using that field. Do the performance will
increase if before doing any group by or join operation I first convert to
rdd to partition by the user id? In other words trying something like the
next lines in all my user data tables will improve the performance in the
long run?:

val partitioned_rdd = unpartitioned_df
   .map(row=>(row.getLong(0), row))
   .partitionBy(new HashPartitioner(200))
   .map(x => x._2)

val partitioned_df = hc.createDataFrame(partitioned_rdd,
unpartitioned_df.schema)


Thanks a lot
-- 
Cesar Flores

--001a11c3879ac459be05221273ab
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><br></div><div>My current version of spark is 1.3.0 a=
nd my question is the next:</div><div><br></div><div>I have large data fram=
es where the main field is an user id. I need to do many group by&#39;s and=
 joins using that field. Do the performance will increase if before doing a=
ny group by or join operation I first convert to rdd to partition by the us=
er id? In other words trying something like the next lines in all my user d=
ata tables will improve the performance in the long run?:</div><div><br></d=
iv><div>val partitioned_rdd =3D unpartitioned_df</div><div>=C2=A0 =C2=A0.ma=
p(row=3D&gt;(row.getLong(0), row))</div><div>=C2=A0 =C2=A0.partitionBy(new =
HashPartitioner(200))</div><div>=C2=A0 =C2=A0.map(x =3D&gt; x._2)</div><div=
><br></div><div>val partitioned_df =3D hc.createDataFrame(partitioned_rdd, =
unpartitioned_df.schema)</div><div><br></div><div><br></div><div><br></div>=
<div><br></div><div>Thanks a lot</div><div>--=C2=A0<br></div><div class=3D"=
gmail_signature">Cesar Flores</div>
</div>

--001a11c3879ac459be05221273ab--