Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <B92F329A-94B2-421B-942A-DD98AE5B835F@adobe.com>
References: <28C3AAC5-3F99-41CF-AAF3-F59A6D916E57@icloud.com>
	<CAGPpM8Vn8b1hmNXvDVWGxcCeXayUsndBDMmQCsH0k-cNADOCqQ@mail.gmail.com>
	<CAOe5pUQMj=pONV5b4jRswtvBtXz9=rXDDG94Ho2uB8bap1Cdow@mail.gmail.com>
	<B92F329A-94B2-421B-942A-DD98AE5B835F@adobe.com>
Date: Mon, 19 Oct 2015 10:44:49 +0530
Message-ID: 
 <CAOe5pUQ1KOgofn9=3WBPbDJYfz7drXf7AXpfS3N-dL_MLBZX8Q@mail.gmail.com>
Subject: Re: repartition vs partitionby
From: shahid ashraf <shahid@trialx.com>
To: Adrian Tanase <atanase@adobe.com>
Cc: Raghavendra Pandey <raghavendra.pandey@gmail.com>,
 shahid qadri <shahidashraff@icloud.com>,
	User <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=089e0111ceb6763b1605226e394d

--089e0111ceb6763b1605226e394d
Content-Type: text/plain; charset=UTF-8

yes i am trying to do so. but it will try to repartition whole data.. can't
we split a large partition(data skewed partition) into multiple partitions
(any idea on this.).

On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase <atanase@adobe.com> wrote:

> If the dataset allows it you can try to write a custom partitioner to help
> spark distribute the data more uniformly.
>
> Sent from my iPhone
>
> On 17 Oct 2015, at 16:14, shahid ashraf <shahid@trialx.com> wrote:
>
> yes i know about that,its in case to reduce partitions. the point here is
> the data is skewed to few partitions..
>
>
> On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <
> raghavendra.pandey@gmail.com> wrote:
>
>> You can use coalesce function, if you want to reduce the number of
>> partitions. This one minimizes the data shuffle.
>>
>> -Raghav
>>
>> On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri <shahidashraff@icloud.com>
>> wrote:
>>
>>> Hi folks
>>>
>>> I need to reparation large set of data around(300G) as i see some
>>> portions have large data(data skew)
>>>
>>> i have pairRDDs [({},{}),({},{}),({},{})]
>>>
>>> what is the best way to solve the the problem
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>
>
> --
> with Regards
> Shahid Ashraf
>
>


-- 
with Regards
Shahid Ashraf

--089e0111ceb6763b1605226e394d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">yes i am trying to do so. but it will try to repartition w=
hole data.. can&#39;t we split a large partition(data skewed partition) int=
o multiple partitions (any idea on this.).</div><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tana=
se <span dir=3D"ltr">&lt;<a href=3D"mailto:atanase@adobe.com" target=3D"_bl=
ank">atanase@adobe.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">


<div dir=3D"auto">
<div>If the dataset allows it you can try to write a custom partitioner to =
help spark distribute the data more uniformly.<br>
<br>
Sent from my iPhone</div><div><div class=3D"h5">
<div><br>
On 17 Oct 2015, at 16:14, shahid ashraf &lt;<a href=3D"mailto:shahid@trialx=
.com" target=3D"_blank">shahid@trialx.com</a>&gt; wrote:<br>
<br>
</div>
<blockquote type=3D"cite">
<div>
<div dir=3D"ltr">yes i know about that,its in case to reduce partitions. th=
e point here is the data is skewed to few partitions..
<div><br>
</div>
</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pan=
dey <span dir=3D"ltr">
&lt;<a href=3D"mailto:raghavendra.pandey@gmail.com" target=3D"_blank">ragha=
vendra.pandey@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">You can use coalesce function, if you want to reduce the n=
umber of partitions. This one minimizes the data shuffle.=C2=A0
<div><br>
</div>
<div>-Raghav</div>
</div>
<div>
<div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri <s=
pan dir=3D"ltr">
&lt;<a href=3D"mailto:shahidashraff@icloud.com" target=3D"_blank">shahidash=
raff@icloud.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi folks<br>
<br>
I need to reparation large set of data around(300G) as i see some portions =
have large data(data skew)<br>
<br>
i have pairRDDs [({},{}),({},{}),({},{})]<br>
<br>
what is the best way to solve the the problem<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@spark.apache.org=
" target=3D"_blank">
user-unsubscribe@spark.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@spark.apache.o=
rg" target=3D"_blank">
user-help@spark.apache.org</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
<br clear=3D"all">
<div><br>
</div>
-- <br>
<div>
<div dir=3D"ltr">with Regards
<div>Shahid Ashraf</div>
</div>
</div>
</div>
</div>
</blockquote>
</div></div></div>

</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature"><div dir=3D"ltr">with Regards<div>Shahid Ashraf</div><=
/div></div>
</div>

--089e0111ceb6763b1605226e394d--