spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From martinjaggi <>
Subject [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Date Tue, 18 Feb 2014 23:30:09 GMT
Github user martinjaggi commented on the pull request:
    Thanks @mengxr for the benchmark efforts! Just not sure if you got my comment about part
2) in the benchmark, k-means: In my opinion this algorithm is not very unsuitable to judge
the sparse vector overhead, since it's the only method in MLlib currently that does *not*
communicate the vectors (only the dense centers). In contrast, all gradient based methods
need to communicate the sparse vectors in each iteration (of a MR). For these, often serialization
can take about the same time as taking the vector x vector product, which is all the computation;
so just saying that both are important in practice, but currently we only benchmark one of
the two, right?
    Maybe things like that might have something to do with what @etrain ran into with early
sparse tests? Or do you guys think this is not an issue? I would be curious to see how the
candidates perform on some of the gradient stuff, and like at which sparsity/load factor the
sparse vectors will start beating the dense vectors.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at or file a JIRA ticket with INFRA.

View raw message