spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <>
Subject [jira] [Updated] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data
Date Fri, 17 Jun 2016 09:48:05 GMT


Nick Pentreath updated SPARK-16008:
    Assignee: Seth Hendrickson

> ML Logistic Regression aggregator serializes unnecessary data
> -------------------------------------------------------------
>                 Key: SPARK-16008
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Seth Hendrickson
>            Assignee: Seth Hendrickson
> LogisticRegressionAggregator class is used to collect gradient updates in ML logistic
regression algorithm. The class stores a reference to the coefficients array of length equal
to the number of features. It also stores a reference to an array of standard deviations which
is length numFeatures also. When a task is completed it serializes the class which also serializes
a copy of the two arrays. These arrays don't need to be serialized (only the gradient updates
are being aggregated). This causes issues performance issues when the number of features is
large and can trigger excess garbage collection when the executor doesn't have much excess
> This results in serializing 2*numFeatures excess data. When multiclass logistic regression
is implemented, the excess will be numFeatures + numClasses * numFeatures.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message