spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhaoxw12 <zhaox...@mails.tsinghua.edu.cn>
Subject How to get well-distribute partition
Date Tue, 25 Feb 2014 02:16:58 GMT
I use spark-0.8.0. This is my code in python.


list = [(i, i*i) for i in xrange(0, 16)]*10
rdd = sc.parallelize(list, 80)
temp = rdd.collect()
temp2 = rdd.partitionBy(16, lambda x: x )
count = 0 
for i in temp2.glom().collect():
  print count, "**", i
  count += 1

This will get result below:

0 ** [(10, 100), (1, 1), (10, 100)... (10, 100), (1, 1), (10, 100)]
1 ** [(2, 4), (11, 121), (11, 121)... (2, 4), (11, 121), (2, 4), (11, 121)]
2 ** [(12, 144), (12, 144), (3, 9)... (12, 144), (12, 144), (3, 9), (12,
144)]
3 ** [(13, 169), (13, 169), (4, 16)... (4, 16), (13, 169), (13, 169), (13,
169)]
4 ** [(14, 196), (5, 25), (5, 25)... (14, 196), (5, 25), (14, 196), (5, 25),
(14, 196)]
5 ** [(6, 36), (6, 36), (15, 225)...(6, 36), (6, 36), (15, 225), (15, 225),
(6, 36)]
6 ** [(7, 49), (7, 49), (7, 49)... (7, 49), (7, 49), (7, 49), (7, 49), (7,
49), (7, 49)]
7 ** [(8, 64), (8, 64), (8, 64)... (8, 64), (8, 64), (8, 64), (8, 64), (8,
64), (8, 64)]
8 ** [(9, 81), (9, 81), (9, 81)... (9, 81), (9, 81), (9, 81), (9, 81), (9,
81), (9, 81)]
9 ** []
10 ** []
11 ** []
12 ** []
13 ** []
14 ** []
15 ** [(0, 0), (0, 0), (0, 0), (0, 0)... (0, 0), (0, 0), (0, 0), (0, 0), (0,
0)]

I want that each partition only has one key number. As you see, there are 6
partitions which are empty and 6 partition which have 2 key numbers. It will
cause me a lot of trouble when I was handling big datas.
I face the same problem as
"http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html". 
But I have not found a solution in python. 
Thank a lot for your help. :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-well-distribute-partition-tp2002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Mime
View raw message