spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurélien Bellet <aurelien.bel...@telecom-paristech.fr>
Subject Re: Random pairs / RDD order
Date Sun, 19 Apr 2015 07:04:32 GMT
Hi Imran,

Thanks for the suggestion! Unfortunately the type does not match. But I 
could write my own function that shuffle the sample though.

Le 4/17/15 9:34 PM, Imran Rashid a écrit :
> if you can store the entire sample for one partition in memory, I think
> you just want:
>
> val sample1 =
> rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle)
> val sample2 =
> rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle)
>
> ...
>
>
>
> On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet
> <aurelien.bellet@telecom-paristech.fr
> <mailto:aurelien.bellet@telecom-paristech.fr>> wrote:
>
>     Hi Sean,
>
>     Thanks a lot for your reply. The problem is that I need to sample
>     random *independent* pairs. If I draw two samples and build all
>     n*(n-1) pairs then there is a lot of dependency. My current solution
>     is also not satisfying because some pairs (the closest ones in a
>     partition) have a much higher probability to be sampled. Not sure
>     how to fix this.
>
>     Aurelien
>
>
>     Le 16/04/2015 20:44, Sean Owen a écrit :
>
>         Use mapPartitions, and then take two random samples of the
>         elements in
>         the partition, and return an iterator over all pairs of them? Should
>         be pretty simple assuming your sample size n is smallish since
>         you're
>         returning ~n^2 pairs.
>
>         On Thu, Apr 16, 2015 at 7:00 PM, abellet
>         <aurelien.bellet@telecom-paristech.fr
>         <mailto:aurelien.bellet@telecom-paristech.fr>> wrote:
>
>             Hi everyone,
>
>             I have a large RDD and I am trying to create a RDD of a
>             random sample of
>             pairs of elements from this RDD. The elements composing a
>             pair should come
>             from the same partition for efficiency. The idea I've come
>             up with is to
>             take two random samples and then use zipPartitions to pair
>             each i-th element
>             of the first sample with the i-th element of the second
>             sample. Here is a
>             sample code illustrating the idea:
>
>             -----------
>             val rdd = sc.parallelize(1 to 60000, 16)
>
>             val sample1 = rdd.sample(true,0.01,42)
>             val sample2 = rdd.sample(true,0.01,43)
>
>             def myfunc(s1: Iterator[Int], s2: Iterator[Int]):
>             Iterator[String] =
>             {
>                 var res = List[String]()
>                 while (s1.hasNext && s2.hasNext)
>                 {
>                   val x = s1.next + " " + s2.next
>                   res ::= x
>                 }
>                 res.iterator
>             }
>
>             val pairs = sample1.zipPartitions(sample2)(myfunc)
>             -------------
>
>             However I am not happy with this solution because each
>             element is most
>             likely to be paired with elements that are "closeby" in the
>             partition. This
>             is because sample returns an "ordered" Iterator.
>
>             Any idea how to fix this? I did not find a way to
>             efficiently shuffle the
>             random sample so far.
>
>             Thanks a lot!
>
>
>
>             --
>             View this message in context:
>             http://apache-spark-user-list.1001560.n3.nabble.com/Random-pairs-RDD-order-tp22529.html
>             Sent from the Apache Spark User List mailing list archive at
>             Nabble.com.
>
>             ---------------------------------------------------------------------
>             To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>             <mailto:user-unsubscribe@spark.apache.org>
>             For additional commands, e-mail: user-help@spark.apache.org
>             <mailto:user-help@spark.apache.org>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <mailto:user-unsubscribe@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <mailto:user-help@spark.apache.org>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message