spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 汪洋 <tiandiwo...@icloud.com>
Subject Re: rdd.distinct with Partitioner
Date Thu, 09 Jun 2016 05:18:47 GMT
Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should
not be exposed to public user either. Because it is a little hard to fully understand how
the partitioner behaves without looking at the actual code.  

And if there exits a basic contract of a Partitioner, maybe it should be stated explicitly
in the document if not enforced by code.

However, I don’t feel too strong to argue about this issue except stating my concern. It
will not cause too much trouble anyway once users learn the semantics. Just a judgement call
by the API designer.


> 在 2016年6月9日,下午12:51,Alexander Pivovarov <apivovarov@gmail.com>
写道:
> 
> reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result 
> 
> Why reduceByKey with Partitioner exists then?
> 
> On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 <tiandiwoxin@icloud.com <mailto:tiandiwoxin@icloud.com>>
wrote:
> Hi Alexander,
> 
> I think it does not guarantee to be right if an arbitrary Partitioner is passed in.
> 
> I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366/latest.html>)
> 
> Best regards,
> 
> Yang
> 
> 
>> 在 2016年6月9日,上午11:42,Alexander Pivovarov <apivovarov@gmail.com
<mailto:apivovarov@gmail.com>> 写道:
>> 
>> most of the RDD methods which shuffle data take Partitioner as a parameter
>> 
>> But rdd.distinct does not have such signature
>> 
>> Should I open a PR for that?
>> 
>> /**
>>  * Return a new RDD containing the distinct elements in this RDD.
>>  */
>> def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
= withScope {
>>   map(x => (x, null)).reduceByKey(partitioner, (x, y) => x).map(_._1)
>> }
> 
> 


Mime
View raw message