Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CFE7D200B0F for ; Fri, 17 Jun 2016 11:48:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CE816160A61; Fri, 17 Jun 2016 09:48:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2395B160A50 for ; Fri, 17 Jun 2016 11:48:07 +0200 (CEST) Received: (qmail 98299 invoked by uid 500); 17 Jun 2016 09:48:05 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 98079 invoked by uid 99); 17 Jun 2016 09:48:05 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2016 09:48:05 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 480D02C1F62 for ; Fri, 17 Jun 2016 09:48:05 +0000 (UTC) Date: Fri, 17 Jun 2016 09:48:05 +0000 (UTC) From: "Nick Pentreath (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Jun 2016 09:48:09 -0000 [ https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-16008: ----------------------------------- Assignee: Seth Hendrickson > ML Logistic Regression aggregator serializes unnecessary data > ------------------------------------------------------------- > > Key: SPARK-16008 > URL: https://issues.apache.org/jira/browse/SPARK-16008 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Seth Hendrickson > Assignee: Seth Hendrickson > > LogisticRegressionAggregator class is used to collect gradient updates in ML logistic regression algorithm. The class stores a reference to the coefficients array of length equal to the number of features. It also stores a reference to an array of standard deviations which is length numFeatures also. When a task is completed it serializes the class which also serializes a copy of the two arrays. These arrays don't need to be serialized (only the gradient updates are being aggregated). This causes issues performance issues when the number of features is large and can trigger excess garbage collection when the executor doesn't have much excess memory. > This results in serializing 2*numFeatures excess data. When multiclass logistic regression is implemented, the excess will be numFeatures + numClasses * numFeatures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org