spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From martinjaggi <...@git.apache.org>
Subject [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Date Tue, 18 Feb 2014 23:30:09 GMT
Github user martinjaggi commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35448685
  
    Thanks @mengxr for the benchmark efforts! Just not sure if you got my comment about part
2) in the benchmark, k-means: In my opinion this algorithm is not very unsuitable to judge
the sparse vector overhead, since it's the only method in MLlib currently that does *not*
communicate the vectors (only the dense centers). In contrast, all gradient based methods
need to communicate the sparse vectors in each iteration (of a MR). For these, often serialization
can take about the same time as taking the vector x vector product, which is all the computation;
so just saying that both are important in practice, but currently we only benchmark one of
the two, right?
    
    Maybe things like that might have something to do with what @etrain ran into with early
sparse tests? Or do you guys think this is not an issue? I would be curious to see how the
candidates perform on some of the gradient stuff, and like at which sparsity/load factor the
sparse vectors will start beating the dense vectors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

Mime
View raw message