Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A53810EB4 for ; Tue, 18 Feb 2014 23:30:36 +0000 (UTC) Received: (qmail 66569 invoked by uid 500); 18 Feb 2014 23:30:35 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 66464 invoked by uid 500); 18 Feb 2014 23:30:34 -0000 Mailing-List: contact dev-help@spark.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.incubator.apache.org Delivered-To: mailing list dev@spark.incubator.apache.org Received: (qmail 66453 invoked by uid 99); 18 Feb 2014 23:30:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Feb 2014 23:30:34 +0000 X-ASF-Spam-Status: No, hits=-2000.6 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 18 Feb 2014 23:30:32 +0000 Received: (qmail 66407 invoked by uid 99); 18 Feb 2014 23:30:10 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Feb 2014 23:30:10 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id D5FD68C21F9; Tue, 18 Feb 2014 23:30:09 +0000 (UTC) From: martinjaggi To: dev@spark.incubator.apache.org Reply-To: dev@spark.incubator.apache.org References: In-Reply-To: Subject: [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor... Content-Type: text/plain Message-Id: <20140218233009.D5FD68C21F9@tyr.zones.apache.org> Date: Tue, 18 Feb 2014 23:30:09 +0000 (UTC) X-Virus-Checked: Checked by ClamAV on apache.org Github user martinjaggi commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35448685 Thanks @mengxr for the benchmark efforts! Just not sure if you got my comment about part 2) in the benchmark, k-means: In my opinion this algorithm is not very unsuitable to judge the sparse vector overhead, since it's the only method in MLlib currently that does *not* communicate the vectors (only the dense centers). In contrast, all gradient based methods need to communicate the sparse vectors in each iteration (of a MR). For these, often serialization can take about the same time as taking the vector x vector product, which is all the computation; so just saying that both are important in practice, but currently we only benchmark one of the two, right? Maybe things like that might have something to do with what @etrain ran into with early sparse tests? Or do you guys think this is not an issue? I would be curious to see how the candidates perform on some of the gradient stuff, and like at which sparsity/load factor the sparse vectors will start beating the dense vectors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---