spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Pitel <guillaume.pi...@exensa.com>
Subject Re: How to efficiently join this two complicated rdds
Date Tue, 18 Feb 2014 10:52:05 GMT
Here's what I would do :

RDD1 :
"1" "2" "3"
"1" "3" "5"

RDD2 :
("1", 11)
("2", 22)
("3", 33)
("5", 55)

1 / flatMap your lines from RDD1 to RDD1bis (key,lineId) (you'll have to use a 
mapPartitionWithIndex for that in order to produce a lineId)
So you go from this :

"1" "2" "3"
"1" "3" "5"

to :

("1","L1")
("2","L1")
("3","L1")
("1","L2")
("3","L2")
("5","L2")

2 / join RDD1 and RDD2 => RDD1+2

("1",("L1",11))
("2",("L1",22))
("3",("L1",33))
("1",("L2",11))
("3",("L2",33))
("5",("L2",55))

3/ map RDD1+2 to (_2)
4/ groupBy _1 (lineIds) and sum
("L1",[11,22,33])
("L2",[11,33,55])

Guillaume

> Sincerely thank you for your work about spark. It simplifies the parallel
> and iteration process of program. It means a lot for us.
>
>
> Our program on spark face a small problem and we seek for your help to find
> an efficient way to solve this problem.
>
>   
>
> Environments:
>
> We run spark in standalone mode. We have two RDDs:
>
> RDD detail:
>
> Type one RDD: generated from sc.textFile (‘file’),each line in ‘file’ is a
> list of keys, like the following lines:
>
>
> 1 149 255 2238 4480 5951 7276 7368 14670 12661 13060 13450 14674
>
> 1 149 255 2238 4480 5951 7276 7368 7678 12672 13078 13450 14674
>
> 1 149 257 2239 4485 5952 7276 7368 7678 12683 13096 13450 14674
>
> 1 149 259 2241 4487 5954 7276 7368 7678 12683 13096 14673 14674
>
> 1 149 260 2242 4488 5955 7276 7368 14670 14671 14672 14673 14674
>
> 1 151 258 2240 4486 5953 7276 7368 14670 12684 13096 13450 14674
>
> 1 151 258 2240 4486 5953 7276 7368 14670 14671 14672 13450 14674
>
> 1 151 259 2241 4487 5954 7276 7368 7678 12683 13096 13450 14674
>
> 1 153 250 2237 4472 5950 7276 7368 14670 14671 13078 14673 14674
>
> 1 153 258 2240 4486 5953 7276 7368 7678 12683 13096 14673 14674
>
> ...
>
> Type two RDD: a set of (key, value).
>
>   
>
> The problem we want to solve:
>
> For each line in RDD one, we need to use the keys of the line to search the
> value according to the key in RDD of type two. And finally get the sum of
> these values.
>
>   
>
> Other Details:
>
> The number of the keys in one line of type one RDD is about 50. The size of
> RDD one file is about 10GB.
>
> The biggest number of key in RDD two is about 4500000, and we will not
> storage the (key, value) if value is zero.
>
> And maybe the type one RDD has a lot key numbers of 1 but a few of 15877.
>
>   
>
> We want to fine a fast way to solve this problem.
>
> Sincerely thanks
>
>
>      Bo Han
> .
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficiently-join-this-two-complicated-rdds-tp1665.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


-- 
eXenSa

	
*Guillaume PITEL, Président*
+33(0)6 25 48 86 80

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05


Mime
View raw message