Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16E8417A2F for ; Sat, 25 Apr 2015 13:37:19 +0000 (UTC) Received: (qmail 98467 invoked by uid 500); 25 Apr 2015 13:37:14 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 98373 invoked by uid 500); 25 Apr 2015 13:37:13 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 98363 invoked by uid 99); 25 Apr 2015 13:37:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Apr 2015 13:37:13 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FORGED_HOTMAIL_RCVD2,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: message received from 54.76.25.247 which is an MX secondary for user@spark.apache.org) Received: from [54.76.25.247] (HELO mx1-eu-west.apache.org) (54.76.25.247) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Apr 2015 13:36:47 +0000 Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTP id 5740D21467 for ; Sat, 25 Apr 2015 13:36:46 +0000 (UTC) Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 467281C16532 for ; Sat, 25 Apr 2015 06:37:24 -0700 (PDT) Date: Sat, 25 Apr 2015 06:36:44 -0700 (MST) From: podioss To: user@spark.apache.org Message-ID: <1429969004515-22656.post@n3.nabble.com> Subject: KMeans takeSample jobs and RDD cached MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming the most time compared to all others while the second one is much quicker and that sparked my interest to investigate what is actually happening. In order to explain it, i checked the source code of the takeSample operation and i saw that there is a count action involved and then the computation of a PartiotionwiseSampledRDD with a PoissonSampler. So my question is,if that count action corresponds to the first takeSample job and if the second takeSample job is the one doing the actual sampling. I also have a question for the RDDs that are created for the k-means. In the middle of the execution under the storage tab of the web ui i can see 3 RDDs with their partitions cached in memory across all nodes which is very helpful for monitoring reasons. The problem is that after the completion i can only see one of them and the portion of the cache memory it used and i would like to ask why the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org