Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAJS65-Wa7hEax5FyS2ah1=WEovLKVjs4k-yhX5PG_Qf-OKjcGw@mail.gmail.com>
References: 
 <CAJS65-Wa7hEax5FyS2ah1=WEovLKVjs4k-yhX5PG_Qf-OKjcGw@mail.gmail.com>
From: Michael Armbrust <michael@databricks.com>
Date: Wed, 14 Oct 2015 10:11:13 -0700
Message-ID: 
 <CAAswR-7a=vukR92_SZ9mgZg9uEpTcSvmKm-BKBTeqbFQrtM74A@mail.gmail.com>
Subject: Re: Question about data frame partitioning in Spark 1.3.0
To: Cesar Flores <cesar7@gmail.com>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=089e01182d967b2c36052213a783

--089e01182d967b2c36052213a783
Content-Type: text/plain; charset=UTF-8

This won't help as for two reasons:
 1) Its all still just creating lineage since you aren't caching the
partitioned data.  It will still fetch the shuffled blocks for each query.
 2) The query optimizer is not aware of RDD level partitioning since its
mostly a blackbox.

1) could be fixed by adding caching.  2) is on our roadmap (though you'd
have to use logical DataFrame expressions to do the partitioning instead of
a class based partitioner).

On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores <cesar7@gmail.com> wrote:

>
> My current version of spark is 1.3.0 and my question is the next:
>
> I have large data frames where the main field is an user id. I need to do
> many group by's and joins using that field. Do the performance will
> increase if before doing any group by or join operation I first convert to
> rdd to partition by the user id? In other words trying something like the
> next lines in all my user data tables will improve the performance in the
> long run?:
>
> val partitioned_rdd = unpartitioned_df
>    .map(row=>(row.getLong(0), row))
>    .partitionBy(new HashPartitioner(200))
>    .map(x => x._2)
>
> val partitioned_df = hc.createDataFrame(partitioned_rdd,
> unpartitioned_df.schema)
>
>
>
>
> Thanks a lot
> --
> Cesar Flores
>

--089e01182d967b2c36052213a783
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">This won&#39;t help as for two reasons:<div>=C2=A01) Its a=
ll still just creating lineage since you aren&#39;t caching the partitioned=
 data.=C2=A0 It will still fetch the shuffled blocks for each query.</div><=
div>=C2=A02) The query optimizer is not aware of RDD level partitioning sin=
ce its mostly a blackbox.</div><div><br></div><div>1) could be fixed by add=
ing caching. =C2=A02) is on our roadmap (though you&#39;d have to use logic=
al DataFrame expressions to do the partitioning instead of a class based pa=
rtitioner).</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_q=
uote">On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores <span dir=3D"ltr">&lt;<=
a href=3D"mailto:cesar7@gmail.com" target=3D"_blank">cesar7@gmail.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><b=
r></div><div>My current version of spark is 1.3.0 and my question is the ne=
xt:</div><div><br></div><div>I have large data frames where the main field =
is an user id. I need to do many group by&#39;s and joins using that field.=
 Do the performance will increase if before doing any group by or join oper=
ation I first convert to rdd to partition by the user id? In other words tr=
ying something like the next lines in all my user data tables will improve =
the performance in the long run?:</div><div><br></div><div>val partitioned_=
rdd =3D unpartitioned_df</div><div>=C2=A0 =C2=A0.map(row=3D&gt;(row.getLong=
(0), row))</div><div>=C2=A0 =C2=A0.partitionBy(new HashPartitioner(200))</d=
iv><div>=C2=A0 =C2=A0.map(x =3D&gt; x._2)</div><div><br></div><div>val part=
itioned_df =3D hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema)=
</div><div><br></div><div><br></div><div><br></div><div><br></div><div>Than=
ks a lot</div><span class=3D"HOEnZb"><font color=3D"#888888"><div>--=C2=A0<=
br></div><div>Cesar Flores</div>
</font></span></div>
</blockquote></div><br></div>

--089e01182d967b2c36052213a783--