spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jtengyp <...@git.apache.org>
Subject [GitHub] spark issue #17898: Optimize the CartesianRDD to reduce repeatedly data fetc...
Date Mon, 08 May 2017 09:28:42 GMT
Github user jtengyp commented on the issue:

    https://github.com/apache/spark/pull/17898
  
    Here is my test:
    Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
    Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each is a 10-dim
vector.
    With default CartesianRDD, cartesian time is 2420.7s.
    With this proposal, cartesian time is 45.3s
    50x faster than the original method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message