GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19904
[SPARK-22707][ML] Optimize CrossValidator fitting memory occupation by models
## What changes were proposed in this pull request?
Via some test I found CrossValidator still exists memory issue, it will still occupy `O(n*sizeof(model))`
memory for holding models when fitting, if well optimized, it should be `O(parallelism*sizeof(model))`
This is because modelFutures will hold the reference to model object after future is complete
(we can use `future.value.get.get` to fetch it), and the `Future.sequence` and the `modelFutures`
array holds references to each model future. So all model object are keep referenced until
`fit` return. So it will still occupy `O(n*sizeof(model))` memory.
I fix this by merging the `modelFuture` and `foldMetricFuture` together, and via `wait/notify`
to unpersist training dataset in time.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark fix_cross_validator_memory_issue
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19904.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19904
----
commit 7725fd8a86dddba6c61c7d053dfa510a114bebb8
Author: WeichenXu <weichen.xu@databricks.com>
Date: 2017-12-05T11:45:42Z
init pr
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
|