spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From podioss <>
Subject KMeans takeSample jobs and RDD cached
Date Sat, 25 Apr 2015 13:36:44 GMT
i am running k-means algorithm with initialization mode set to random and
various dataset sizes and values for clusters and i have a question
regarding the takeSample job of the algorithm.
More specific i notice that in every application there are  two sampling
jobs. The first one is consuming the most time compared to all others while
the second one is much quicker and that sparked my interest to investigate
what is actually happening. 
In order to explain it, i  checked the source code of the takeSample
operation and i saw that there is a count action involved and then the
computation of a PartiotionwiseSampledRDD with a PoissonSampler.
So my question is,if that count action corresponds to the first takeSample
job and if the second takeSample job is the one doing the actual sampling.

I also have a question for the RDDs that are created for the k-means. In the
middle of the execution under the storage tab of the web ui i can see 3 RDDs
with their partitions cached in memory across all nodes which is very
helpful for monitoring reasons. The problem is that after the completion i
can only see one of them and the portion of the cache memory it used and i
would like to ask why the web ui doesn't display all the RDDs involded in
the computation.

Thank you

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message