spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MLnick <...@git.apache.org>
Subject [GitHub] spark pull request #18733: [SPARK-21535][ML]Reduce memory requirement for Cr...
Date Thu, 03 Aug 2017 11:55:06 GMT
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18733#discussion_r131121125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
    @@ -112,16 +112,16 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val
uid: String)
           val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
           // multi-model training
           logDebug(s"Train split $splitIndex with multiple sets of parameters.")
    -      val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
    -      trainingDataset.unpersist()
           var i = 0
           while (i < numModels) {
    +        val model = est.fit(trainingDataset, epm(i)).asInstanceOf[Model[_]]
             // TODO: duplicate evaluator to take extra params from input
    -        val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
    +        val metric = eval.evaluate(model.transform(validationDataset, epm(i)))
             logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
             metrics(i) += metric
             i += 1
           }
    +      trainingDataset.unpersist()
    --- End diff --
    
    One consideration here is that we're unpersisting the training data only after all models
(for a fold) are evaluated. This means the full dataset (train and validation) is in cluster
memory throughout, whereas previously only one dataset would be in cluster memory at a time.
It's possible the impact of this on resources may be a greater than the saving on the driver
from storing `1` instead of `numModels` models temporarily per fold?
    
    It obviously depends on a lot of factors (dataset size, cluster resources, driver memory,
model size, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message