spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: PySpark RDD.partitionBy() requires an RDD of tuples
Date Wed, 02 Apr 2014 22:26:38 GMT
Will be in 1.0.0


On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> Ah, now I see what Aaron was referring to. So I'm guessing we will get
> this in the next release or two. Thank you.
>
>
>
> On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra <mark@clearstorydata.com>wrote:
>
>> There is a repartition method in pyspark master:
>> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
>>
>>
>> On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Update: I'm now using this ghetto function to partition the RDD I get
>>> back when I call textFile() on a gzipped file:
>>>
>>> # Python 2.6
>>> def partitionRDD(rdd, numPartitions):
>>>     counter = {'a': 0}
>>>     def count_up(x):
>>>         counter['a'] += 1
>>>         return counter['a']
>>>     return (rdd.keyBy(count_up)
>>>         .partitionBy(numPartitions)
>>>         .map(lambda (counter, data): data))
>>>
>>> If there's supposed to be a built-in Spark method to do this, I'd love
>>> to learn more about it.
>>>
>>> Nick
>>>
>>>
>>> On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> Hmm, doing help(rdd) in PySpark doesn't show a method called
>>>> repartition(). Trying rdd.repartition() or rdd.repartition(10) also
>>>> fail. I'm on 0.9.0.
>>>>
>>>> The approach I'm going with to partition my MappedRDD is to key it by a
>>>> random int, and then partition it.
>>>>
>>>> So something like:
>>>>
>>>> rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition;
>>>> minSplits is not actionable due to gzip
>>>> keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we
>>>> can partition it
>>>> partitioned_rdd = keyed_rdd.partitionBy(10)     # rdd has 10 partitions
>>>>
>>>> Are you saying I don't have to do this?
>>>>
>>>> Nick
>>>>
>>>>
>>>>
>>>> On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilikerps@gmail.com>wrote:
>>>>
>>>>> Hm, yeah, the docs are not clear on this one. The function you're
>>>>> looking for to change the number of partitions on any ol' RDD is
>>>>> "repartition()", which is available in master but for some reason doesn't
>>>>> seem to show up in the latest docs. Sorry about that, I also didn't realize
>>>>> partitionBy() had this behavior from reading the Python docs (though
it is
>>>>> consistent with the Scala API, just more type-safe there).
>>>>>
>>>>>
>>>>> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> Just an FYI, it's not obvious from the docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
the following code should fail:
>>>>>>
>>>>>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
>>>>>> a._jrdd.splits().size()
>>>>>> a.count()
>>>>>> b = a.partitionBy(5)
>>>>>> b._jrdd.splits().size()
>>>>>> b.count()
>>>>>>
>>>>>> I figured out from the example that if I generated a key by doing
this
>>>>>>
>>>>>> b = a.map(lambda x: (x, x)).partitionBy(5)
>>>>>>
>>>>>>  then all would be well.
>>>>>>
>>>>>> In other words, partitionBy() only works on RDDs of tuples. Is that
>>>>>> correct?
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> View this message in context: PySpark RDD.partitionBy() requires
an
>>>>>> RDD of tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html>
>>>>>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message