spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <daniel.ha...@veracity-group.com>
Subject Re: Local Repartition
Date Mon, 20 Jul 2015 15:46:43 GMT
Great explanation.

Thanks guys!

Daniel

> On 20 ביולי 2015, at 18:12, Silvio Fiorito <silvio.fiorito@granturing.com>
wrote:
> 
> Hi Daniel,
> 
> Coalesce, by default will not cause a shuffle. The second parameter when set to true
will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true).
> 
> It will attempt to keep colocated partitions together (as you describe) on the same executor.
What may happen is you lose data locality if you reduce the partitions to fewer than the number
of executors. You obviously also reduce parallelism so you need to be aware of that as you
decide when to call coalesce.
> 
> Thanks,
> Silvio
> 
> From: Daniel Haviv
> Date: Monday, July 20, 2015 at 4:59 PM
> To: Doug Balog
> Cc: user
> Subject: Re: Local Repartition
> 
> Thanks Doug,
> coalesce might invoke a shuffle as well. 
> I don't think what I'm suggesting is a feature but it definitely should be.
> 
> Daniel
> 
>> On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog <doug@balog.net> wrote:
>> Hi Daniel,
>> Take a look at .coalesce()
>> I’ve seen good results by coalescing to num executors * 10, but I’m still trying
to figure out the
>> optimal number of partitions per executor.
>> To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)
>> 
>> 
>> Cheers,
>> 
>> Doug
>> 
>> > On Jul 20, 2015, at 5:04 AM, Daniel Haviv <daniel.haviv@veracity-group.com>
wrote:
>> >
>> > Hi,
>> > My data is constructed from a lot of small files which results in a lot of partitions
per RDD.
>> > Is there some way to locally repartition the RDD without shuffling so that all
of the partitions that reside on a specific node will become X partitions on the same node
?
>> >
>> > Thank you.
>> > Daniel
> 

Mime
View raw message